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A conventional test and two forms of a stradaptive test were administered 
to thousands of simulated subjects by minicomputer* Characteristics of the 
three tests using several scpring ^techniques were investigated while varying 
the discriminating power of the items« the' lengths of the tests^ and the 
availability of prior information about the testee's ability level* The 
tests were evaluatec in terms of their correlations with underlying ability* 
the amount of infortaation they provided about ability* and the equiprecision 
of jteasjyrem^^^^ they ^^xhibited* .JWajor findings. Wjarje. l>,s.cpres on the conven ■ 
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tional test correlated progresfllvely lesfl with ability as Item discriminating 
power was Increased beyond a=*l,0; 2) the conventional test provided Increas- 
ingly poorer equlpreclslon of meaflurement as items became more discriminating; 
3) these undesirable characteristicfi were not characteristic of scores on the 
stradaptlve test; 4) the stradaptlve test provided higher flcore-ablllty 
correlations than the conventional test when item d is criminations were high: 
5) the stradaptlve test provided more information and better equlpreclslon of 
measurement than the conventional test when test lengths and item dlscrimin^ 
ations were the same for the two strategies; 6) the use of valid prior ability 
estimates by stradaptlve strategies resulted in. scores which had better 
measurement characteristics than scores derived from a fixed entry point; 
7) a Bayeslan scoring technique implemented within the stradaptlve testing 
strategy provided scores with, good measur^ent^ characteristics; and 8) 
further research is necessary to develop improved flexible tetntinatlon 
criteria for the stradaptlve test* 
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A Simulation Study of Stradaptive Ability Testing 



Adaptive Teatlne 

In constructing tests to measure mental abilities » the goal Is to extract 
as mach Information as possible about an Individual's ability level from a limited 
set of test Items. Until recently » constrained to use paper and pencil admlnls^ 
tratlon techniques* test makers typically built a "peaked" ability test with all. 
Items of a difficulty such that a person with average ability could answer about 
half of them correctly. This type of test provided the highest level of Infor- 
mation In the middle range of ability (Lord» 1970)* Assuming a normal distribution 
of ability* the middle range Is Where most Individuals were concentrated* Consequently » 
a perfectly peaked ability test provided the highest possible average level of 
Information from a fixed set of Items* 

Unfortunately » for the Individual with a high level of ability* a test 
peaked at the average ability level provided little Information about his or her 
ability because all the items were too easy* The resulting test score for these 
Individuals simply Indicated that they were of high ability^ but the .test was 
not sufficiently sensitive to differentiate within the high ability group* 
Simllarly» for Individuals of low ability the test was often unable to provide 
meaningful measurements 

As on-line computer systems became widely available » It became possible to 
adapt a test to an Individual^ giving him/her Items which were neither too diffi- 
cult nor too easy regardless of his/her ability level* The goal of adaptive^ 
or tailored* testing Is to administer to the testee the subset of available Items 
that will provide the maximum amount of Information about his ability level. 
If his ability were known a ppicHt he would be given a set of Items that a 
person of his ability level would answer about half correctly and about half 
Incorrectly* Since a testee*s ability Is not accurately known before testing 
begins » the testing procedure must be designed to adapt to the Individual's 
ability level as It Is estimated during the course of testing* and thereby 
select the optimal Item set for that Individual* 

A variety of approximation techniques » or strategies of Item selection » 
have been suggested* ranging from simple two-stage techniques to complex 
3ayeslan and maximum likelihood techniques (see Weiss » 1974» for a description 
and comparison of adaptive testing strategies)* This paper reports on the 
stratlfled^adaptlve or stradaptive (Weiss » 1973) testing strategy^ 

The Stradaptive Testing Strategy 

The stradaptive test, as proposed by Weiss (1973), Is a computer-based 
analogue of Blnet*s Intelligence testing approach* It Is based on ah Item pool 
composed of a collection of peaked tests» or strata, ordered by difficulty and 
equally spaced along the ability continuum* best or most dlscrjLp^lnatlng 

Items are placed, at the beginning of each stratum* 

Using this strategy » an Initial stratum assignment Is made on the basis of 
some prior information about the testee's ability* Testing begins with the 
first item of this stratum* On the basis of the testee*s response to each Item, 
he is branched either up or down (typically by one stratum) to a more or less 
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difficult item. As in Binet's test, tenuination occurs when a ceiling stratum 
is reached; thus, the number of items administered, to different individuals 
can vary* The ceiling stratum is one in which the testee ansmrs all items 
incorrectly (or only a chance proportion correct if multiple-choice items 
are used)* Ueiss (1973) suggested, as an operational definition of this ceiling 
level, the least difficult stratum in which a chance proportion (or less) 
of items had been answered correctly after five items from that stratum had 
been administered* As with the Binet test^ the stradaptive test also has a 
basal level which is defined as the most difficult stratum in which all items 
^administered are answered correctly* 

Scoring of the stradaptive test can result in ten scores which can be used 
to estimate ability level and five scores which reflect response consistency* 
The consistency scores are designed to reflect aspects of the interaction of 
an individual and a given item pool which might reflect the degree of error in 
the ability level scores* Examples of stradaptive response records, and further 
discussion of the logic of this testing strategy can be found in Ueiss (1973), 
Vale and Weiss (1975) and Vale (1975)* 

Empirical studies of the stradaptive strateRy * To date there have been txjo 
"live testing" or empirical studies of the stradaptive strategy* Vale and Ueiss 
(1975) administered two forms of the stradaptive test and one conventional test 
to college students* All tests were administered by cathode ray terminals (CRTs) 
'connected to a time-shared computer* The primary goals of this study were 1) 
comparison of the stradaptive test and a conventional test with respect to test*- 
retest stability; 2) comparisons of the ten ability scores with respect to test-^ 
retest stability; and 3) investigation of the utility of the stradaptive 
consistency indices in predicting test-retest stabilities* 

In the first part of the study, single administration data on the 
stradaptive test as originally proposed by Ueiss (1973) were^ collected from 476 
subjects and retest data were collected from 170 subjects* Analyses showed that 
meaningful comparisons between the test-reteet reliabilities of the stradaptive 
and conventional strategies were precluded by many inequilities. between the two 
tests* These included different test lengths^ different proportions of items 
presented both on test and retest, unequal item discriminations, and the unknown 
influence of initial ability estimates on stability* After attempts to 
statistically correct for these inequities, no meaningful difference was found 
between strategies with respect to test-retest stability* 

Xntratest comparisons of stradaptive scores were informative, however* The 
ten ability scores grouped themselves into four clusters; 1) maximum performance 
scores* such as the difficulty of the most difficult item correct; 2) scores 
reflecting the difficulties of the next item or stratum the testee would have 
been given had the test continued for one more stage; 3) scores derived from 
the difficulty of the most difficult stratum in which the proportion correct 
was greater than chance responding; and 4) average difficulty scores* The 
average difficulty scores had the highest test-retest stabilities* and the scores 
derived from difficulties of hypothetical next items and strata had the lowest* 

Three of the five consistency scores were evaluated in terms of their utility 
as variables moderating test-retest stability* This was done by subgrouping 
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testees on the basis of initial test consistency scores and comparing the 
test-retest stabilities of the groups* Two o£ the consistency scores — 
the standard deviations of difficulties of all items administered and of 
all items answered correctly — were quite predictive of test-retest stability, 
but the other consistency score was not* 

In the second part of the study, a modification o£ the original stradaptive 
testing strategy was evaluated* In computerized stradaptive testing, testees 
may respond with a question mark if they do not know the correct answer to 
an item and prefer not to guess* In the original version of the stradaptive ^ 
test, these "omits" were counted as incorrect, both in terms of branching 
decisions and in scoring. To evaluate the impact of not penalizing testees 
for honestly admitting they did not know the answer to an item, a modified 
version of the stradaptive test was studied* In this version, omissions 
were ignored in both scoring and branching; a question mark response 
resulted in the administration of the next item in the stratum* 

This modified stradaptive test was administered to a group of 113 subjects, 
and 79 of thm were retested* Results, when compared to those of the 
original version of the stradaptive test, showed substantial reductions in 
the utility of the consistency scores for predicting stability* \lie clusters 
of ability scores and the ranking of those scores with respect to stability 
were the same as found in the original stradaptive test. Analysis o£ the 
question mark responses showed that subjects omitted it^s that were more 
difficult than those which they ansitered incorrectly* It was concluded, there- 
fore, that question mark responses in the stradaptive test should be treated 
as incorrect responses, rather than being ignored in branching decisions and 
in scoring* 

Three suggestions were made for future research on the stradaptive test* 
They were 1) monte carlo investigation of the utility of the initial ability 
estimates; 2) monte carlo investigation of the utility of different termina- 
tion criteria; and 3) development of improved scoring methods* 

The other study of the stradaptive strategy (Waters, 1974, 1975) 
investigated the modified version of the test in which omissions were ignored* 
Tests in this study were also administered by CRTs connected to a time^shared 
computer. The design allowed the calculation of both parallel forms reliability 
and validity coefficients* Validity was operationalised as the correlation 
of scores obtained from the stradaptive test with scores from a conventional 
test composed of similar items taken earlier* Waters' study did not include 
any analyses of the consistency scores* 

The major findings of Waters* study were 1) that the stradaptive strategy 
was able to attain parallel forms reliabilities and validities comparable to a 
conventional test having twice as many items; 2) that the relative quality of 
the scores with respect to reliability and validity was strongly dependent on 
the termination criterion used; and 3) that the average difficulty of all 
items answered correctly was consistently one of the best ability level scores 
in terms of parallel forms reliability and validity* 

Waters* suggestions for future research were that future studies concen- 
trate on criteria for test termination and on three ability scores — the inter- 
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polated stratum difficulty, the average difficulty of all Items answered 
correctly, and the average difficulty of all Items correct at the most 
difficult stratum with greater than chance responding* 

From the two empirical studies done on the stradaptlve testing strategy, 
two definite findings have emerged* First, of the scoring methods used, the 
average difficulty of all Items answered correctly has consistently been the 
best with respect to test-retest stability, parallel forms reliability and 
correlation with an external criterion* Second, all scores have higher 
stabilities when omitted Items are counted as Incorrect responses* In addition, 
the stradaptlve strategy was found to yield higher alternate forms reliability 
and validity than the conventional test. In one study, and there were some 
data suggesting that the consistency scores were predictive of retest stability* 

Computer Simulation as a Supplement to Live-Testing Studies 

While empirical studies do yield Important findings — Indeed the stability 
of scores and the effect of Ignoring omitted Items could only have been evaluated 
In an empirical study-^-^there are some Questions that these studies are 111- 
equipped to answer* Perhaps the most obvious shortcoming of an empirical study 
Is the fact that It takes a considerable amount of time to test a large number 
of real subjects* Furthermore, real subjects require real Items, and It Is very 
difficult to obtain real Items with appropriate characteristics to evaluate 
some questions of Interest, such as, which strategy Is better — conventional or 
adaptive? In addition, too few live subjects have appropriate abilities for 
evaluating tests at extremely high or low ability levels* 

But the most restrictive shortcoming of llve-^testlng studies Is the fact 
that the testee*s ability level remains unknown to the psychometrlclan * This 
fact precludes calculation of various Indices of how well the ability estimate 
derived from the test reflects the testee's true ability level* One Important 
Index of the goodness of a testing procedure, which uses both estimated and true 
ability. Is Blmbaum's (1968) Information function* The Information function 
adequately reflects the adaptive test's goal of equlpreclslon of measurement; 
equlpreclslon Implies that scores at all ability levels will reflect true ability 
with the same degree of precision* On the other hand, correlation coefficients 
and reliability Indices, which are generally available from empirical studies, 
are weighted by the distributional characteristics of the trait within the 
examinee group (Sympson, 1975), and therefore do not provide data from which 
equlpreclslon of measurement can be determined* 

Given a question which an empirical study Is not equipped to Investigate, 
there are two alternative research approaches — theoretical studies ^nd monte 
carlo stimulation studies* The theoretical study evaluates a strategy by 
varying parameters of Interest on a purely mathematical basis* It Is conceptually 
superior to the monte carlo simulation study because It eliminates error* But, 
due to the complexity of some testing strategies (e*g*, the stradaptlve strategy), 
simplifying assumptions (e*g*, perfectly peaked tests or subtests) are necessary 
which limit the generallzablllty of the theoretical study to the world of real 
data and people* Furthermore, In a strategy like stradaptlve, the number of 
calculations necessary In a theoretical study la prohibitive* Like the theoretical 
study, the simulation study Is based on a mathematical model rather than live 
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subjects* But rather than calculating exact test characteristics* they are 
estimated via a 3to<j;l^a)5tic process* 

Simulation studies are usually run on computers* The computer first 
simulates a testee (which usually consists of randomly generating an ability 
level) and then it generates a sequence of item responses on the basis of a 
mathematical model of how ^ teal subject with the given ability level would 
respond to that set of items* Thousands of "subjects" are usually run and the 
data are analyzed' as real data would be* However » since ability levels are 
knowHi information curves and other statistics which require both ability 
levels and ability estimates can be calculated* Like live^testing studies and 
unlike theoretical studies » summary statistics on simulation study data are 
subject to random fluctuations* But this is not usually a problem if large 
numbers of testees are simulated* 

The study reported herein is a simulation study of the stradaptive testing 
strategy* In this study^ findings from live-^testing studies were re-examined 
using the simulation technique^ and some questions not answerable through live* 
testing studies were investigated* 

METHOD 

Design 

A simulation study is valuable only to the extent that the underlying 
model accurately reflects data from live-testing studies* For this reason » the 
initial phase of this study entailed a simulated replication of the empirical 
study by Vale & Weiss (1975)* Exact replication was not possible, however, 
since the empirical study used test-retest stability as an evaluative criterion* 
However, test-retest correlations in a simulation study are, strictly speaking, 
parallel forms reliability coefficients (Betz & Weiss, 1975) and are formally 
related to the correlation between test scores and the *'true** (i*e*, generating) 
ability* The latter correlation is equivalent to the index of reliability, 
and the square of that correlation is equivalent to a parallel forms reliability 
coefficient* As in the live-testing study, inter cor relations among the scores 
were also calculated* Further analyses not possible in the empirical study, 
such as calculation of information functions, were also done on the original 
version of the stradaptive testing strategy used in the live-testing study 
(here referred to as Variable-Length Stradaptive) * 

The major aim of the present study was a comparative analysis of the 
characteristics of stradaptive test scores and conventional test scores under 
varying conditions* The characteristics of greatest interest were 1) inter- 
correlations among the scores; 2) correlations between generating ability and 
the scores; and 3) information provided about ability by the scores at various 
levels of ability* 

The conditions varied within the stradaptive test were 1) the scoring 
method; 2) the discriminating power of the Items; 3) the quality of prior 
information available about ability; and 4) the number of Items administered* 
The discriminating power was varied by using one of three hypothetical 
item pools, described below, with item discrimination fixed at i2"0*5, i2«l*0^ 
or a»2*0* For the replication of empirical results using Variable-Length 
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Stradaptlve, a fourth pool containing parameters of real Items with varying 
discrimination was also used* The quality of prior Information was varied' by 
providing the strategy with an Initial ability estimate either fixed at 
0"0*0» or distributed normally with a mean of zero and a standard deviation 
of 1*0» and correlating *0» *5» or 1*0 with generating ability* The extreme 
correlations were chosen to provide the upper and lower bounds on the effect 
of initial ability estimates and the *5 correlation was chosen as a typical 
value* The fixed test lengths Investigated In this study were 10» 20» 40» 
and 60 Items, 

In order to allow several levels of all conditions to be completely 
crossed) a simplified version of the stradaptlve strategy (referred to as 
Fixed~Length Stradaptlve) i»as adopted* Fixed-Length Stradaptlve had a simpler 
administration strategy* used fewer scoring methods* and had a fixed 
termination criterion (thus allowing test length to be manipulated) * One further 
analysis* outside of the crossed design* was done on Fixed-Length Stradaptlve* 
Three potential termination criteria were evaluated on the basis of how well 
they correlated with eiror of measurement* This was done to determine whether 
flexible termination of the stradaptlve test would be useful In providing 
equlpreclse measurement * 

Tests 

. Conventional Test 

A conventional test was Included for purposes of comparison* Items In this 
test were simply administered In a linear order (l*e. * with no branching on the 
basis of responses) and the test was scored by calculating the proportion of 
items answered correctly* 

Variable-Length Stradaptlve 

The logic * This test was Identical to the test used by Vale & Weiss (1975) 
except for some minor changes In scoring strategies* On the basis of an Initial 

ability estimate O^^ the "testee" was given the first Item In one of the nine 

available strata* The stratum* S» from which the first Item was administered was 

determined by rounding the function S« 0j/,$5 + 5 when 1 - S •* 9> and set to 

the nearest end point when outside that Interval* 

If the testee*s response to the first Item was correct* he was branched 
to ("administered*^ an Item from) the next more difficult stratum. If his 
response was incorrect* he was branched to the next easier strattjm. If there 
was not a sufficiently easy or difficult stratum for the required branchings 
(l.e*» when an incorrect response was given to an Item In the least difficult 
stratum* or a correct response to an Item in the most difficult strattjm) the 
testee was given another Item In the same stratum* Testing continued until a 
termination criterion was reached* The termination criterion used for Variable- 
Length Stradaptlve was the Identification of a stratum In x^lch 20% or less 
of the items were answered correctly after at least five items had been 
administered* 
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The scores > Fifteen scores—ten ability scores and five consistency scores- 
were evaluated by Vale & Weiss (1975) and are described In detail In that report. 
These scores, which were also examined In this study, were as follows: 

Ability Scores 

Score 1. The difficulty of the most difficult Item answered correctly. 

Score 2. The difficulty of the (K+l)tft Iton or the next Item that 
would have been administered had testing continued* 

Score 3* The difficulty of the most difficult Item answered correctly 
at a stratum less, difficult than the celling stratum (i.e., 
the most difficult Item In the most difficult stratum having 
a chance proportion or less correct); or. If no real Item 
existed, the difficulty of a hypothetical Item (l*e.,'the 
average difficulty of Items that would be In the hypothe- 
tical stratum If It existed) In a hypothetical stratum below 
the lowest available stratum* 

Scores 4, 5, and 6. The average difficulties of all Items In the strata 
In which Items determining scores 1, 2 and 3 respectively 
are found* 

Score 7* The Interpolated stratum difficulty, mathematically defined as: 

D - + S(P ,-*50) 
c-1 c-1 

where ^^^^ the average difficulty of the Items In the 

(C"l)ih stratum, and C Is the celling stratum* F . 

c-l 

Is the proportion correct at the (C-Dth stratum and S Is 
\ - " ^c-l) i -50 or ? _^ - ? _2 If P(^.p < .50 

Score 8* The average difficulty of all Items answered correctly* 

Score 9* The average difficulty of all Items answered correctly 
between the celling stratum and the basal stratum (l*e*, 
the most difficult stratum In which all^ltems administered 
were answered correctly)* Defined as (D^ - ^c-l^^^ celling 
and basal strata were adjacent* 

Score 10* The average difficulty of Items answered correctly In the 
(c-l)tft stratum, or, the difficulty of the hypothetical 
stratum Immediately below the easiest stratum available If 
the testee failed to respond correctly at greater than chance 
rate In any stratum* 

Consistency Scores 

Score 11* The standard deviation of difficulties of all Items 
administered. 
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Score 12* The stan^lard deviation of difficultiea of all items 
answered correctly* 

Score 13* The standard deviation of difficulties of all items answered 
correctly between the ceiling and basal strata* 

Score 14* The difference in average difficulties of ceiling and 
basal strata* 

Score 15* The number of strata between the ceiling and basal strata* 

Flxed"Length Stradaptive 

The stradaptive testing strategy was an important addition to the 
strategies of adaptive testing because it took a realistic account of the 
practical testing a ituat ion—it ems were structured in an efficient manner, 
available prior information was used, and a flexible termination rule was 
implemented* But some of these practical virtues rendered the strategy very 
difficult to evaluate* In live^ji^sting, the initial ability estimates inflated 
testrcr iter ion correlations somewhat and the flexible temination made construction 
of a comparable conventional test difficult* Thus, for research purposes, 
it was appropriate to develop a simpler version which would be easier to evaluate* 

. Further changes in the strategy were suggested by previous research and 
other considerations* The two major changes involved eliminating some of 
the scoring strategies and ceasing to use the ceiling and basal strata* As was 
discussed earlier, score 8, the average difficulty of all items answered 
correctly, was consistently the best of the original ten ability scores in 
empirical studies* Thus, only that score was used in the analysis of 
Fixed-Length Stradaptive* 

All consistency scores which required finding the ceiling or basal strata 
were also eliminated from this part of the study* Although conceptually simple, 
locating these two strata required some rather complex logic for subjects 
whose ability was at the extreme upper or lower end of the item pool* In these 
cases, the ceiling and/or basal strategies were essentially arbitrarily 
determined* Since the simulation study would result in substantial numbers 
of testees with extreme ability levels, these scores were not used in this 
study to eliminate the effects of such arbitrary decisions* 

The logic * The administration logic of Fixed-Length Stradaptive was 
identical to that of Variable-Length Stradaptive except for the termination 
criterion* A testee was given the first Item in one of nine strata chosen on 
the basis of the same function of initial ability estimate used for Variable- 
Length Stradaptive* Following a correct response he was branched to the next 
more difficult stratum and following an incorrect response was branched to the 
next easier stratum* This process continued until a predetermined numbet of 
Items had been administered (in this study either 10, 20, 40, or 60 items)* 
Flexible termination was not used in this version of the stradaptive strategy* 

The scores * Six scores were calculated for Flxed*Length Stradaptive — three 
ability scores and three scores intended to predict errors of measurement* The 
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flrst score was score 8 from Variable-Length Stradaptive — the average difficulty 
of all Items answered correctly. This score was Included since It was the best 
of the scores In the stradaptlve live-testing studies. 

The second score was the average difficulty of all Items administered. 
This was a modification of the Variable-Length Stradaptlve* s score 8» the average 
difficulty of all items answered cortectly^ and has been used for scoring other 
types of adaptive strategies (e.g.» Lord, 1970; Larkln & Weiss, 1974; Weiss, 
1974). Average difficulty of all Items administered was Investigated because 
It Is less apt to be affected by erratic response records. For exaujple, 
consider the case of a fixed termination rule, such as "stop after 40 Items.'* 
A person might b^gln the test In the easiest stratum and Incorrectly answer the 
first 31 Items. That same person might, by chance, answer the last nine Items 
correctly thus progressing to the most difficult stratum^ That person would 
obtain the same average difficulty correct score as a person who answered 20 Items 
correctly In the fifth stratum and 20 Items Incorrectly In the sixth stratum. 
In these two cases, however, the second testee would h:ive encountered more diffi- 
cult It^s, on the average, and answered more of them correctly. 

The third score was Owen^s (1969) Bayeslan scoring technique. This scoring 
method was Included as a mathematically '^optimal" score, for comparison to the 
rational £^corlng methods. Population parameters (I.e., a normal prior ability 
distribution with mean of zero and standard deviation of 1.0) were used to 
Initialize the scoring procedure for all subjects, regardless of their entry 
point, and the score was updated after each Item response. The FORTRAN IV sub- 
routine used to calculate this score is Included In Appendix A. 

The three error predictor scores were Included In this study for evaluation 
as termination criteria to be used for flexible termination. The first two were 
based on the consistency scores Investigated by Vale & Weiss (1975). The original 
consistency scores considered only variability of the response record and not Its 
length. Length has been explicitly taken Into consideration In the error 
predictor scores In a manner analogous to the way that the number of observations 
Is taken Into consideration In calculation of the standard error of a tjiean. 

Score 4 Is the standard deviation of the difficulties of all Items answered 
correctly divided by the square root of the number of Items answered correctly. 
Score 5 Is the standard deviation of the difficulties of all items administered 
divided by the square root of the number of Items administered. Score 6 Is the 
standard error provided by Owen's (1969) formula (l*e.* the square root of the 
Bayeslan posterior variance after the last Item has been administered). 

I ton Pools 

Obtaining Item pools for live-testing studies Is a long and tedious process 
Involving writing the Items, administering them, normlng them, and selecting 
those most approporlate to the test being constructed. With the relatively 
large Item pools required by adaptive testing strategies. It Is difficult to 
Investigate the effects of varying Item parameters because a sufficient number 
of Items Is generally not available. In simulation studies, however, acquisition 
of It^ pools Is very easy, since once the desired parameters are specified, the 
Item pool Is available. Item pools used In simulation studies do not contain real 
Items with real content but are rather simply a set of parameters of hypothetical 
Items. - ^ ' 
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Certaln limitations, fltemmlng from the mathematical model on which Item 
responses are based, are Inherent In sisrulatlon studies* Because the Items 
lack content, assumptions made by some test models^ (such as local Independence) 
are explicitly programmed Into the response model and the possibility that these 
assumptions may be violated by real subjects Is Ignored. Because of the slmpll" 
city of the response models used, effects of memory and thus test-retest stability 
cannot be examined* Simulation studies are not meant to be a substitute for 
empirical studies and the simplicity of Item pool construction Is not without 
Its costs* But In a simulation study, a variety of Item pool conditions 
can be readily constructed* Consequently^ some questions Which cannot be 
Investigated In llve^testlng studies, such as the amount of Information provided 
by a testing procedure, can be Investigated* 

A Real Item Pool 

For the simulated . replication of empirical findings with Variable^-Length 
Stradaptive, the parameters of the 269 items used for the modified stradaptive 
test by Vale & Weiss (1975) were used* These parameters, as well as summary 
statistics, are included in Appendix Table B-1* Although the item parameters 
used in the present study were those of the modified stradaptive test used in 
the live-testing study, the branching procedures used were those of the original 
version* 

Hypothetical Item Pools 

Although use of parameters obtained from real item pools retains an element 
of reality not possessed by use of purely hypothetical item pools, it is not 
feasible to manipulate the item parameters of Interest by this procedure* To 
investigate the effects of item discrimination on test characteristics, three 
stradaptive test item pools with normal ogive discrimination indices (a) of 
*S, 1*0 and 2*0 were generated* Since the effects of variability in item 
discriminations wers not Investigated in this study, discriminations were held 
constant within each of the three item pools* 

Item difficulty parameters were generated separately for each item pool* 
These parameters were randomly and rectangularly distributed within each of 
nine equally spaced strata* Difficulty parameters for each of these three item 
pools are included in Appendix Tables B-2, B-3 and B-4* 

Also generated were three item pools for the conventional test* These 
pools had constant normal ogive discrimination indices of ^S, 1.0 ^d 2*0 
and were randomly and rectangularly distributed within the same range of diffi- 
culty as the middle stratum of the stradaptive item pools (i*e*, between 
i>=-*33 and i>«+*33)* Difficulty parameters for these pools are included in 
Appendix Table B-5* 

Generation of the Data 

Data in this simulation study were obtained in a way very similar to the 
way data are collected in a live-testing study* A testing strategy program 
chose an item> administered that item, and on the basis of responses to several 
items, calculated a score* The difference between this study and a live-testing 
study was that in an empirical study^ each item is administered to an actual 
testee, while in this study iteois were "administered" to an item response 
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filnmlator* The filmulator then assefified the Item parameters and a testee's 
generated ability level and generated a "correct" or "Incorrect" response. 

The response simulator used a two-step procedure. First, the probability 
of a correct response given the "testee*s" ability was calculated from the 
following equation suggested by Lord (1970); 

P^(0)-C^+(l-C^)^[a^(0-b^)J [1] 



where P (0) H probability that a testee with ability 0 will correctly 

answer an Item, 

C B probability of a correct answer due to guessing (set to. 

.20 for this study); 
^Ix] = the unit normal dlstr-'.butlon Integrated from to the 

standard deviate, x, 
a^ a the discriminating power of the Item, 

0 S the ability level of the testee, 

b^ X the difficulty of the Item. 

After the probability of a correct response was detemined, a random number 
was generated from a rectangular distribution between 0 and 1. If this random 
number was greater than the probability of answering the Item correctly, the 
item was considered answered incorrectly; otherwise It was considered correct. 
This procedure Is Identical to that used by Betz & Weiss (1974) , and has been 
used in a variety of other studies In a slightly different form (e.g., Jensema, 
1972; Urry, 1970, 1974). 

Generation of the underlying ability was done In two different ways; this 
is discussed below. The computer and computer programs used in this study 
are described In Appendix C. 



Data Analysis 

Descriptive Statistics 

Means and standard deviations were calculated for all scores of Variable-* 
Length Stradaptlve under all four variations of Initial ability estimates using 
each of the four item pools. These statistics were. In each condition, computed 
from administration of the test to 1000 hypothetical testees with abilities 
sampled from a normal distribution with mean of zero and standard deviation 
of one. These statistics were computed on Variable-Length Stradaptlve primarily 
for comparison with empirical data obtained previously. 

Means and standard deviations were calculated for all scores of Fixed- 
Length Stradaptlve to determine how the new scores were distributed, and to 
assess the effects of varying characteristics of the test. These statistics 
were caliiulated for the four lengths of 10, 20, 40 and 60 Items under all 
conditions of Initial ability estimates, but using only item parameters 
of the three hypothetical pools. As with Variable-Length Stradaptlve, these 
statistics were computed from administration of the test to 1000 hypothetical 
testees with abilities sampled from a normal distribution with mean of zero and 
standard deviation of 1.0. 



ERIC 



-12^ 



Correlational Analysis 

Inter^score correlations * Correlations among scores were calculated for 
both Variable-Length Stradaptlve and Fixed-Length Stradaptlve. These Inter- 
correlatiohs were calculated using the Initial ability estimate fixed at 6>0» 
on 15»000 hypothetical testees sampled from a normally distributed population* 

Score-*ablllty correlations . In classical test theory (Gulllksen, 1950), 
the correlation of ability and test score Is^ referred to as/the "index of 
reliability'*. In modern tftst theory (Lord & Novlck, 1968; Urry, 1970) It Is 
referred to as ^Validity". In live-testing research It Is estimated by taking 
the square root of an alternate form reliability coefficient* In simulation 
research, it Is calculated directly since "ability" Is known* Score-ability 
correlations are useful If the researcher Is interested In assessing how well 
test scores predict ability for some specified population as a whole* 

These correlations were calculated for all tests under all variations 
of conditions* Within each of the conditions, this correlation was calculated 
on the same sample of 1000 testees used for the descriptive statistics* 

In addition to providing a comparison among scores In the simulation study, 
the squared Index of reliability for Variable-Length Stradaptlve should be 
comparable to a parallel forms reliability coefficient^ such as the one reported 
by Haters (1974, 1975)* Because of the effect of memory In a test-retest 
design, it should be somewhat less directly comparable to test-retest stabilities 
such as those reported by Vale & Weiss (1975)* 

Information Analyses 

While the correlation between test score and ability Is a relevant index 
of how well a score predicts ability for a whole population. It provides little 
information about how a score predicts ability level within different levels of 
that ability* For example, a score-ability correlation for a conventional test 
may be higher than a score-ability correlation for an adaptive test, even though 
the adaptive test provides a higher precision of measurement at the extremes of 
ability, simply because correlations are strongly influenced by the larger 
number of observations In the middle range of a normally distributed population* 
Since adaptive tests distribute their precision throughout the range of ability, 
while precise measurement for peaked conventional tests Is concentrated In the 
middle range of ability, a score-ability correlation Is a statistic biased In 
favor of the conventional test, and Is hot an optimal statistic for comparison 
of the two strategies* Thus, evaluation of two testing strategies in terms 
of other criteria, which are less influenced by the distribution of ability 
in the population. Is desirable (see Sympson, 1975, for a discussion of 
evaluation criteria)* 

Information curves * The information provided by a score about an ability 
at some level of that ability Is roughly analogous to the precision of measure- 
ment at that point, or the ability to discriminate between two arbitrarily close 
ability levels centered on that point (Lord, 1970)* The graph of these 
information values plotted against all values of ability Is called the infor- 
mation curve* In this study, as In Bet2 & Weiss (1974), information curves 
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were constructed calculating information values from a formulai suggested 
by Bimbaum (1968) » at several points along the ability continuum* 

Bimbaum*3 formula is; 



1,(0. 



[2] 



where 1 (O) Is the Information about Q provided by score x. 

The numerator of Equation 2 may be viewed as a transforming function, converting 
the score, x, into ability units. It is also the partial derivative of the 
score with respect to ability (0) evaluated at a particular level of ability, 
indicating the relative rate of change of the two variables. The denominator 
is simply the conditional standard deviation of the score, or the dispersion 
of the score, x, evaluated at a fixed level of ability (i.e., imprecision of 
measurement) . 



To calculate information values, 1000 response records were generated at 
each of fifteen equally spaced levels of ability ranging from ^3.5 through 
0.0 to +3.5; For the middle thirteen points, partial derivatives of the score 
means were calculated with respect to ability at each level of generating 
ability by taking the derivative of the second degree Lagrangian interpolation 
polynomial fitted to three sucessive points. This technique finds the first 
derivative of the second degree polynomial best fitting the point of interest 
and the two adjacent points. Because points on each side of the point of interest 
were needed to estimate the polynomial, the endpoints (i.e., -3.5 and +3.5) were 
not considered in calculating the information values. When the derivatives 
were obtained, they were divided by the standard deviation of the scores 
at the level of ability on which the derivative was centered and then squared 
to yield the information at that point. Connecting the thirteen values of 
information thus calculated yielded an information curve. 

Statistics descriptive of information curves . Graphs are a simple and 
sufficient way to present information curves, but they are difficult to compare 
when many information curves are involved. Thus, for economy of presentation, 
the means and coefficients of variation of information values at the thirteen 
points defining the information curves were computed. 

The mean or average information was computed for each of the information 
curves. Mean information is a statistic that is not disproportionately 
weighted by ability distribution characteristics as is the correlation 
coefficient. The higher the mean information, the more information the score 
provides about ability at all levels of ability, on the average. Higher mean 
information implies better measurement. 

The coefficient of variation was also computed for each information 
curve. This statistic is of interest because its departure from zero means 
that the. goal of equiprecision of measurement is not being achieved. This 
index is equal to the standard deviation of the thirteen points of the informa- 
tion curve divided by their mean and multiplied by 100 (see Guilford, 1950, p. 118, 
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for a discussion of the coefficient of variation)* It was chosen over the 
standard deviation becaoae, unlike the standard deviation, it is not affected 
by the absolute magnitude of the information curve* For example, the height of 
the information curve of a conventional test composed of equivalent items is 
directly proportional to the length of the test and therefore, the mean and 
standard deviation of the values of the information curve are similarly 
proportional to length* Double the length of a conventional test and the 
mean and standard deviation of the information curve both double* The co- 
efficient of variation, being doubled in both the numerator and the denominator 
would remain constant, however* 

The choice of the coefficient of variation as an evaluative criterion 
involved the value judgment that the relative rather than the absolute variation 
was important in evaluating the equiprecision provided by a score on a test* 
The information curve of an ideal test — one with measurement of equal precision 
throughout the ability range — ^would have a high mean and a coefficient of 
variation equal to zero* 

Termination Criterion Analysis 

One goal in the design of the stradaptive strategy was to identify response 
records in which the test was not unambiguously locating the testee's ability 
level, so that the test could be extended in length to provide better precision 
of measurement for that testee* Thus, it was intended that a flexible 
termination criterion be used so that different individuals could be administered 
different numbers of items* In this study, the three error predictor scores 
of Fixed-Length Stradaptive (scores 4, 5 and 6) were examined as possible candi- 
dates for the termination criterion* 

To perform these evaluations, 1000 administrations using each of the three 
hypothetical stradaptive item pools were randomly terminated after an average 
of 30 items; the standard deviation of number of items at termination was 6* 
The tests were terminated at various lengths because a termination criterion 
must be effective at all lengths, since a real test might terminate at ^y of 
several lengths* The error scores were not allowed to function as termination 
criteria in this analysis because an effective termination criterion would hold 
the error of measurement constant* Then, with no variability in error of 
measuronent, there could be no correlation between it and test score* 

To provide a criterion of error of measurement to predict, ability scores 
and generating ability were first standardized within each set of 1000 admin- 
istrations to account for differences in scoring metric* Then the unsigned 
differences between the standardized ability scores and the standardized 
generating ability were calculated* Error predictor scores were then correlated 
with these absolute errors, since it was expected that a useful termination 
criterion would correlate with these errors* It may be noted that this procedure 
is identical to Ghiselli's (1956) procedure for discovery of moderator variables* 
The search is for a variable to correlate with absolute deviations from the line 
of relations* 
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RESULTS 

Analysis of the Coaventlonal Test 

Descriptive Statistics 

The means and standard deviations of the conv^tlonal test scores with 
varying test lengths and Item discriminations are presented In columns three 
and four of Table 1. The mean proportions correct vete all close to a value 
of .60 which was only slightly higher than the value of .588 obtained by 
Vale & Weiss (1975) In their live administration of a 40-ltem conventional 
test. A slight difference In the opposite direction was expected because 
the test used In the empirical administration had easier Items (5«>-.368). 

Table 1 

Summary Statistics for the Conventional Test 
as a Function of Item Discrimination and Test Length 



Information 

Correlation Coefficient 





No, 






with 




of 


Discrimination (a) 


Items 


Mean 


S.D. 


Ability 


Mean 


Variation 


0.5 


10 


.612 


.209 


.703 


.725 


42.046 




20 


.607 


.179 


.811 


1.448 


40.290 




40 


.598 


.162 


.887 


2.882 


41.232 




60 


.600 


.158 


.917 


4.307 


39.617 


1.0 


10 


.616 


.267 


.851 


1.771 


75.451 




20 


.592 


.250 


.908 


3.198 


88.171 




40 


.600 


.243 


.938 


6.444 


87.808 




60 


.597 


.238 


.950 


9.595 


87.978 


2.0 


10 


.597 


.326 


.888 


3.484 


138.147 




20 


.592 


.317 


.906 


6.601 


139.150 




40 


.605 


.311 


.918 


13.630 ' 


1-33 i 632 




60 


.612 


.307 


.926 


19.674 


135.361 



This dlscrepency m^ have been due to the fact that the Items used In the 
live-testing study were normed on relatively small groups of subjects (see 
McBrlde & Weiss, 1974). Waters* (1974^ 1975) conventional tests had a mean 
proportion correct of .665, which was expected because his Items were easier 
(5— .368) and normed on larger subject groups. Ho trend among the score means 
across varying teat lengths and Item discrimination was apparent. 

Standard deviations of conventional test scores showed two trends. As 
Item discriminating power Increased, the standard devlatlona of the scores 
Increased. As test length was Increased, standard deviation decreased. 
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Correlatlonal Analyses 

Columi five of Table 1 presents the correlations of conventional test scores 
with generating ability. The results show, as expected, tha^i the correlation 
Increases as the length of the test Is Increased* A more cdmpl^ trand was 
observed with respect to the Item discrimination. The score*ablllty correlation 
Increased with Increasing Item discrimination, and th^ tapered off* The 
correlation between score and ability was a joint Junction of test length and Item 
discrimination. For the l6-ltem test the score-ablllty c^tSiaHon Inc^ 
Increases In Item discrimination* For the 2p*ltem, test, the score-ability 
correlation improved as discriminations were Increased from ^5 to 1*0, but 
remained about the same as dlscriittinatlons were increased to 2*0* On 40-^ and 
60-item tests, score*ability correlations improved when discriminations were 
increased from '5 to 1*0, but were lower for items of 2.0 discrimination* 

The attenuation. paradox (Loevlnger 1954; Sltgreaves, 1961) Is the apparent 
reason for the lower score-ability correlations with higher item discriminations* 
The attenuation paradox refers to the fact that as Items get more discriminating, 
they provide more information at a point and less information at abilities dis- 
tant from that point* The conventional test had Items all of similar diffi- 
culty, and as items became more discriminating. It measured less accurately for 
testees with abilities outside of an increasingly narrow range* 

Information Analyses 

Column six of Table 1 shows the average information provided by the 
conventional test* Average information increased in almost direct proportion 
to test length, a result that was expected from modem test theory (lord & 
Hovlck, 1968)* It also increased in almost direct proportion to item discriminating 
power* The decrease observed in score-ability correlations as item discriminations 
became high (2*0) was not observed with respect to average information* 

Column seven of Table 1 presents the coefficient of variation, an index of 
equlpreclslon of measurement* With discriminations held constant, this, index 
remained relatively constant across tests of different length, as It was expected 
to do* It increased considerably, however, with changing Item discriminations, 
indicating that the conventional test provides relatively poorer equlpreclslon 
of measurement as Items become more discriminating* The trend involved seemed 
to indicate that the coefficient of variation Is directly proportional to the 
item dlacrimlnatlons, doubling when discriminations are doubled* The relation 
was not completely linear, however, as the coefficient of variation with dis- 
crimination of 2*0 was less than the value of about 170 that would be anticipated 
from a strictly linear relationship* 

Analysis of Variable-Length Stradaptlve 

Descriptive Statistics 

Table 2 presents the means of the fifteen Variable*Length Stradaptlve test 
scores (scores 1*^10 are ability scores^ and 11-1-5 are consistency scores)* 
Independent variables are 1) the Initial ability estimate, fixed at 0*0 and 
randomly distributed H(0,1) correlating 0*0, *5 and 1*0 with the generating 
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Table 2 

Mean Scores for Variable-Length Stradaptlve 
as a Jolnfe Function of Ability Estimate Validities, 
Item Discriminations (a) 
and Average Humber of Iteios Administered 



Score 



Fixed 
Entry 



Initial Ability 
Correlation 



Initial Ability 



0.0 



0;5 



1.0 
1 .540 
1.170 

.802 
1.300 

.818 
.523 
.253 
.782 

1.030 
.691 
.361 
.858 

1.450 
1.100 
.713 
1.230 

.823 
.315 
.269 
.803 

.842 
.476 
.155 
.676 

.900 
.534 
.^36 
.734 



Score 



Fixed 


..Correlation 




Entry 


0.0 


0.5 


1.0 


.206 
.053 
-.048 
.202 


.201 
.033 
-.037 

. 99£ 

. C C V 


.J ta- 

.043 
-.055 


.200 
.013 
-.062 
.173 


.813 
.494 
.163 
.665 


.821 
.456 
.181 

_ AAA 


.811 
.471 
.178 

. A >i it 
• O 9 9 


.824 
.451 
.135 
.643 


.834 
.743 
.673 
.757 


.875 
.785 
.713 

. O 1 w 


.862 
.742 
.672 
.759 


.814 
.725 
.625 
.724 


.787 
.669 
.575 
.698 


.827 
.713 
.607 
.755 


.815 
.670 
.575 
.693 


.760 
.652 

.529 
.663 


.498 

.268 
.405 


.508 
.396 
.272 
.424 


.522 
.398 
.275 
.4.14 


.499 
.404 
.271 
.417 


2.550 
2M10 
1.620 
2.140 


2.590 
2.080 
1.620 
2.190 


2.650 
2.080 
1.630 
2.150 


2.580 
2.120 
1.610 
2.170 


2.910 
2^230 
U480 
2*260 


2.970 
2.180 
1.480 
2»360 


3.050 
2.190 
1.490 
2.300 


2.950 
2.240 
1.460 
2.320 



L 0.5 1.580 
1.0 1.200 
2.0 .888 

variable 1 .350 



2 0.5 
1.0 
2.0 

variable 



.840 
.565 
.276 
.770 



3 0.5 1.020 

1.0 -732 

2.0 "385 

variable .875 

i 0.5 1-490 

1.0 1-130 

2.0 -803 

variable 1.270 



5 0.5 
1.0 
2.0 
variable 

5 0.5 
1.0 
2.0 

variable 

7 0.5 
1.0 
2.0 

variable 

0.5 

1.0 

2.0 
variable 



8 



.839 
.559 
.292 
.793 

.829 
.522 
.181 
.695 

.887 
.577 
.262 
.760 

»18t 
.006 
.116 
»126 



1.660 
1.220 
.915 
1.430 

>81! 
.542 
.310 
.782 

1.030 
.691 
.400 
.908 

1.560 
1.180 
.834 
1.360 

.819 
.541 
.317 
.802 

.837 
.480 
.196 
.729 

.887 
.537 
.273 
.794 

.192 
-.027 
-.131 

.127 



1 .600 
1.170 
.880 
1.320 

.793 
.573 
.271 
.761 

1 .020 
.707 
.392 
.869 

1.500 
1.090 
.791 
1.240 

.801 
.569 
.279 
.785 

.824 
.497 
.189 
.685 

.878 
.550 
.269 
.747 

.156 
-.019 
-.129 

.108 



10 



11 



12 



13 



14 



15 



0.5 
1.0 
2.0 
variable 

0.5 
1.0 
2.0 
variable 



0.5 
1.0 
2.0 
variable 

0.5 
1.0 
2.0 
variable 

0.5 
1.0 
2.0 



0.5 
1.0 
2.0 



0.5 
1.0 
2.0 



.167 Test 0.5 

-.025 Length 1.0 
-.126 2 

.114 



54*000 56.100 55.800 55.100 

40.800 39.600 39.700 40.200 

2 n 28.400 29.400 29.500 27.100 

variable 31*900 32.000 31.200 31.800 
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ability; and 2) three item pools with discriminations fixed at a«.5» l/O and 2.0, 
and the real item pool with varying discriminations that was previously used in 
an empirical study of the stradaptive test (Vale & W<iiss, 1975). Table 2 shows 
two trends in the means as a function of item parameters and initial ability 
estimates* The means of all fifteen scores became lower as the items became more 
discriminating! means of the consistency scores » scores 11 to 15 » 

decreased with increasingly valid initial ability estimates* 

The trend in the means of the ten ability scores to decrease as a function 
of item discriminations Is at least partly due to the effects of guessing* Implicit 
in the up*one| down**one branching strategy used in the stradaptive test is the 
goal of converging on items of difficulty such that the testee's probability 
of answering correctly is *5* Since all ten Variable-Length Stradaptive ability 
scores are some rough monotonic function of this difficulty, anything that 
affects that difficulty will monotonically affect the ability scores* 

Difficulty in terms of probability of a correct response, discrimination, 
and guessing parameters is obtained by rearranging equation 1; 

where f [x] is the inverse function of f[x] yielding the standardized normal 
deviate when given the cumulative proportion* 



The difficulty yielding a probability of being correct of *5, assuming a 
guessing probability of *2, is: 

[41 

in which b ±E a decreasing function of a* This shows that the optimal difficulty, 
and thus the ability scores, should decrease as a function of discrimination when 
guessing is possible (specifically with a probability of *2)* When guessing is 
not possible, equation 3 reduces to; 

b«e [5] 

in which b is not a function of a* Therefore, the ability scores are expected to 
decrease as a function of item discriminations through their relation with this 
^'optimal" difficulty level only when guessing is possible* 

In addition, some scores (e«g*, the difficulty of the most difficult item 
correct) would be expected to decrease as a function of increasing item discrimination 
even when guessing is not possible due to their joint dependence on the optimal 
difficulty level and variability of the response record* For example, a testee 
with an inconsistent response record (i*e*, one which ranges across a large number 
of strata) would be expected to have a more difficult item correct than a testee 
having a more consistent response record and the same average difficulty of 
items, simply because he encountered a few more difficult items (along with, of 
course, a few less difficult items)* 
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The confllfltency flcores were expected to decrease as fewer Inappropriate Items 
were administered In the process of locating the appropriate strata from which to 
administer Iteins (l.e.^ as the Initial ability estimates Improved and the entry . 
point became closer to the testee*s true ability) and as fewer Incorrect branchings 
were made (due to more discriminating Items). An observation xrorthy of note with 
respect to the decrease In mean consistency scores as function of Initial 
ability estimates Is that some consistency scores » under some conditions* were not 
lower using valid Initial ability estimates than they were using a fixed entry 
point at the middle stratum (e.g.* the between celling and basal strata 
variability scores » scores 13» 14 or 15) or required a reasonably good Initial 
ability estimate before they Improved (e.g.* the overall variability scores* 11 
and 12). 

Mean scores generated using the real Item pool were slightly higher In this 
3tudy than In the empirical study by Vale & Uelss (1975) using nearly the same 
pool on real subjects (e.g., 1.350 vs 1.073 for score 1; .770 vs .560 for score 
2). This may have been due either to Inadequacies In the simulation model or 
Inaccurate Item parameters In the llve^testlng study. 

Test lengths showed a marked decreasing trend as Item discriminations 
Improved, but no apparent trend at all with respect to goodness of Initial 
ability estimates. The average number of Items administered using the real Item 
pool varied between 31.2 and 32.0 Items; these means were between those obtained 
using the hypothetical pools with discriminations of a^l.O and a°2.0. This was 
not expected because the mean discrimination of the real pool was a=.717. This 
was an underestimate of the discriminations of the Items actually administered, 
however » because the most discriminating Items were placed first In the pool; 
the corrected average discrimination value In the llve*-testlng study was a'=.879. 
It appears that putting a im highly discriminating Items at the beginning of 
each stratum may drastically shorten the stradaptlve test when the original 
termination criterion Is used. The mean length of the Identical strategy 
in the live-testing study using a slightly different Item pool (l.e.» one with 
a £exf less Items) varied between 27.8 and 31.4 Items. Thus» the mean lengths 
obtained In this simulation were reasonably close to those obtained with real 
subjects. 

Table 3 shows standard deviations for the 15 scores on Variable-Length 
Stradaptlve for combinations of the same independent variables. For both the 
ability level and consistency scores there was no apparent trend with respect 
to initial ability estimates^ and only a very slight tendency for the standard 
deviations to decline as Item discriminations improved. Score 8, the average 
difficulty of all items answered correctly^ was the least variable of the 
ability scores. Scores 11, 12 and 13, the standard deviation consistency 
scores, had low variability both with respect to the ability scores and the 
distance variability scores, 14 and 15. The differences among the standard 
deviations observed In this study were also apparent in the live-testing study^ 
although in this study the values were slightly lower. This latter result was 
expected^ though^ since the llve*testlng study Included sources of eoor not 
Included In a simulation study. 

Correlational Analyses 

Table 4 presents the Intercorrelatlons among the Variable-Length Stradaptlve 
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Table 3 

Standard Deviations of Variable-Length Stradaptive Scores 
as a Joint Function of Ability Estimate Validities 
and Item Discriminations (a) 
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8 0.5 .888 .918 .944 .992 

1.0 .910 .915 .977 1 .010 

2.0 .917 .933 .976 1.010 

variable .961 .971 1.030 1.070 
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lotercorrelatlona ot Vaxlable^Length Stradaptlve Scorea 
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scores. The same four clufiters of ability scores observed In the live-testing 
study were apparent here* The three Item difficulty scores^ scored 1| 2 and 3^ 
formed three two-variable clusters^ each wlth^^thelr respective stratum difficulty 
scoresi scores 4« 3 and 6* Additionally^ scores 3 and 6» the highest non*^chance 
Item and stratum scores^ formed a tlg^t cluster with scores 7 and 10» the 
Interpolated stratum difficulty and average highest non-chance Item scores* 
The fourth ability score cluster was composed of scores 8 and 9» the average 
difficulty scores* 

Clustering among the variability scores was obvious and again consistent with 
the live-testing data* Scores 11 and 12| the overall variability scores^ formed 
one cluster* Scores 13« 14 and 15| the between celling and basal strata consistency 
scores I formed another* 

Table 5 presents the correlations of Variable-Length Stradaptlve ability 
scores with the ability that generated the responses* The expected increasing 
trend In correlation with improving ialtlal ability estimates was not observed 
with any regularity across different scores and different parameters althpugh 
the trend was apparent for scores 8 and 9* 

Table 5 

Score-Ability Correlations for Stradaptive^ 
as a Joint Function of Item Discriminations (a) 
and Ialtlal Ability Validities 
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A definite Increasing trend in score-ability correlation with increasing 
item discrimination was observed » however* This result suggests that there may 
be an inadequacy in the termination criterion as it should vary the length of 
the test to keep constant the accuracy of measurementi and thus the score* 
ability correlation* 

The parameters from the real item pool provided score-ability correlations 
somewhere between the values provided by the hypothetical pools with 
discriminations of *5 and 1*0* This result is consistent with their average 
discrimination value of about *88* 

Scores 8 and 9 consistently had the highest correlation with ability^ a 
finding consistent with the fact that these two scores showed the highest 
stability in the live-testing study by Vale & Weiss (1975) and generally 
highest validities and reliabilities in the study by Waters (1974» 1975)* . 

Comparison with the conventional test * Comparison of the score-ability 
correlations of Variable-Length Stradaptive with those of the conventional 
test (Table 1) is difficult because the flexible termination criterion of 
Variable-Length Stradaptive yields test lengths not directly comparable to those 
of the conventional test* A rough comparison showed the stradaptive strategy as 
better using items with discriminations of 2*0* Using stradaptive score 8» and 
comparing the fixed entry point administration (the fair comparison since the 
conventional test cannot utilize prior information)* the stradaptive test corre- 
lated *95Q_with an average of 28*4 items while the conventional test correlated 
only *918 with 40 items and *926 with 60 items* Thus» with highly discriminating 
items» the 28-item stradaptive test correlated higher with ability than did the 
60-item conventional test* 

The conventional test achieved higher score-ability correlations than the 
stradaptive test with less discriminating items* With discriminations of 1*0 
the stradaptive test score 8 correlated only *918 after an average of 40*8 items 
while the conventional test correlated *938 after 40 items* At discriminations 
of *5» stradaptive score 8 correlated *805 with ability after an average of 54 
items and the conventional test correlated *887 after 40 items and *917 after 
60 items* Thus, Variable«Length Stradaptive testing strategy was superior to a 
conventional test, with respect to score ability correlations, only when given 
very discriminating items* ^ 

Information Analyses 

Table 6 presents the average information values provided by Variable-Length 
Stradaptive scores* As with the score ability correlations, an increasing trend 
in information was observed for all scores as item discriminations improved* 
The values of average information provided by the items from the real item pool 
were, as were the correlations, between those for item pools with discriminations 
of *5 and 1*0* The increasing trend observed in the correlations as a function 
of validity of initial ability estimates was somewhat more apparent in the 
average information data than it was in the correlation data (notably for scores 
1, 4, and 9). It also appeared in some conditions of most other scores* Score 
9, the average difficulty of all items answered correctly between the ceiling 
and basal strata, provided the highest average level of information* Score 8, 
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Table 6 

Average Information Provided by 
Variable-Length Stradaptlve Scores 
a Joint Fimctlon of Item Dlscriiiilnatlohs (a) 
and Initial Ability Validities 



Score 



Fixed 
Entry 



Initial Ability 
Correlation . . 



0.0 



0.5 



..1;0 Score 



Fixed 
Entry 



Initial Ability 
■ -Correlation 



UTS 7581 TesT 7765 6 oTF .800 

X.O t.l72 1.208 1.450 1.748 1.0 1.915 

2.0 t .823 1 . 605 2.174 2.520 2.0 3>?13 

variable .926 .^05 1 .232 t.635 variable 1.683 



0.0 



0.5 



1.0 



.757 .800 .815 

1.786 t.971 1,966 

3.930 4.tt6 3.682 

1.550 1.840 t.939 



2 0.5 .757 
1.0 1.439 
2.0 2.286 

variable 4 .302 



. 7 55 . 847 .814 

1.518^- t .4t9 1.718 

2.988 2.545 2.454 

1.251 1.423 1.488 



0.5 .961 .948 1.000 1.01J 

1.0 2.894 2.675 2.835 2.958 

2.0 7.8^5 8.398 8.934 7.956 

variable 2.038' 1 .866 2.286 .2.381 



3 0.5 .811 .780 

1.0 1.811 1.723 

2.0 3.441 3.703 

variable 1 .751 1.608 



e800 .826 8 0.5 1.963 1.749 2.001 2.939 

1.853 1.88d 1.0 4.664 3.397 5.521 7.514 

3.917 3.502 2.0 9.868 6.809 12.391 12.303 

1 .880 1 .971 variable 2.782 2.127 3.134 4.066 



'* 0.5 .518 .507 .649 .787 9. 0.5 1.909 1.974 2.221 2.460 

1.0 .954 .858 1 .341 1 .787 1.0 5.786 5.541 6.024 6.939 

2.0 1.391 1.320 1.658 2.025 2.0 11.864 11.593 13.210 13.617 

variable .675 .745 t . 1 68 t .583 variable 3.409 3 . 1 65 3.976 4. 589 



S 0.5 .819 .767 .900 .861 

1.0 1.587 1.633 1.556 1.794 

2.0 2.604 2.914 2.987 2.797 

variable 1 .385 1.335 1.494 1.528 



10 0.5 .786 .764 .795 .808 

1.0 1.884 1.779 1.918 1.967 

2.0 3.584 3.781. 4.119 3.573 

variable 1 .704 1 . 560 1..863 1 .967 



the average difficulty of all Items answered correctly^ which had correlated 
highest with ability^ also provided relatively high levels of Information. 

Table 7 contains coefficients of" variation of the Information values for 
Variable-Length Stradaptlve ability scores. No definite trends with respect to 
either Item discriminations or Quality of Initial ability estimates were apparent » 
although in 31 out of 42 comparisons a fixed entry point provided more equl- 
preclse measurement than did a random Invalid one (l.e.'» r»0.0) . 

Comparison with the conventional test . With discriminations of .5 and the 
entry point fixed » stradaptlve score. 9 provided an average Information value of 
1.909 (Table 6) with an average of 54.0 Items (Tablei 2). With 40 Items the 
conventional test (Table 1) provided a higher average value of .2.882. When 
discriminations were 1.0» stradaptlve score 9 provided 5.786. units of information 
using 40.8 items on the average. This was still below, the average value of 6.444 
provided by the conventional test with 40 Items. When discriminations were 2.0» 
stradaptlve score 9 provided U.864 units of information with 28.4 ltems» a value 
between the values of 6.601 and 13.630 provided by the conventional test with 20 
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and 40 items » respectively. When Variable-Length Stradaptive was permit;ted to 
function as designedi i.e.» with a variable entry pointy a moderately valid 
initial ability estimte (2**". 50) resulted in average information of 13.2 (based 
on an average of 29.5 iteins) for stradaptive^ compared to 13.630 for the 
longer 40-item conventional test^ with discrlMnations of 2.0. 



Table 7 

Coefficients of Variation for Information Values of 
Variable-Length Stradaptive Scores* 
as a Joint Function of item Discriminations (a) 
and Initial Ability Validities 



Score 



a 



Fixed 
Entry 



Initial Ability 
Correlations 



0.0 



0.5 



1.0 



Score 



Fixed 
Entry 



Initial Ability 
Correlations 



33.48 
1.0 18.80 
2.0 23.67 
variable 64.87 



0.0 



0.5 



1.0 



34.93 47.36 33.64 21.10 

1.0 43.83 59.69 24.17 26.67 

2.0 63.69 64.64 49.60 28.23 

variable 45,47 55.73 44.76 62.09 

0.5 35.22 43.99 39.5,6 63.74 

1.0 24.59 32.28 33.34 30.87 

2.0 32.62 70.96 t7.87 36.72 

variable 63.23 62.90 68.47 69.46 

0.5 30.35 31.88 32.58 29.43 

1-0 18.85 18.21 30.92 23.05 

2.0 22.36 31.66 34.61 15.50 

variable 68.70 56.51 65.66 70.52 

0*5 43.93 65.70 37.97 50.71 

1.0 61 .96 56.98 31.18 72.89 

2.0 67.98 7 6.03 41 .74 32.94 

variable 52,10 65.74 51.76 66.21 

0.5 36.00 43.59 39.80 58.68 

1*0 25.30 35.49 32.47 29.62 

2.0 34.06 35. 12 37.03 33.35 

variable 60.35 61.50 66.30 69.08 



10 



0.5 17.93 

1.0 31.61 

2.0 79.23 103.99 103.13 

variable 61.39 50.92 59.10 



34.98 
19.53 
26.34 
55.82 

24.51 
36.19 



36.05 
34.76 
23.84 
62.68 

38.02 
26.13 



0.5 
1.0 
2.0 
variable 

0.5 
1.0 
2.0 



45.27 
50.56 
57.18 
52.15 

36.36 
43.37 
57.04 



variable 43.38 



0.5 
1.0 
,2.0 
variable 



33.11 
15.01 
25.43 
69.69 



52.03 
50.06 
58.62 
69.13 

50.85 
52.39 
55.19 
39.10 

33.58 
16.83 
34>55 
57.81 



45.85 
52. 19 
73.13 
57.24 

52.99 
46*67 
50.30 
51.41 

35.20 
30.25 
41.13 
65.90 



30.92 
25.21 
20.07 
67.78 

23.29 
26.77 
79.49 
65.06 

40.82 
44.36 
41.25 
59.56 

40.80 
29.81 
44.65 
57.78 

27.89 
20.90 
20.73 
69.85 
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With a fixed entry into the stradaptive test, the most informative score^ 
score 9> was more equiprecise than the conventional test score in all comparable 
conditions. With discriminations of 2.0 the conventional test score showed a 
coefficient of variation more than twice as large (see Table 1) as that of Variable- 
Length Stradaptive 's score 9 (Table 7). In fact, given equal item discriminations, 
only two stradaptive scores, scores 4 and 8, were less equiprecise than scores on 
a comparable conventional test and then only for poorly discriminating items. 
In general, as item discriminations Increased, the relative equiprecision of 
the stradaptive test became considerably greater than that of the conventional 
test. 
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Table 8 

Mean Scores for Fixed-Length Stradaptive as a 
Joint Function of item Discriminations (a), Test Length » 
and Validity of Initial .Ablllty .Estiiiiaees 



Initial Ability M%tUl Ability 



No . Fixed Correlation No . Fixed Correlation 



a 


Items 


Entry 


0.0 


0.5 


1.0 


a 


Items 


Entry 


0.0 


0.5 , 


1.0 








Score 1 










Score 


4 




0.5 


10 


• 085 


.070 


.093 


.0 53 


0.5 


10 


.284 


.306 


.287 


.266 






.186 


.193 


.182 


. 1 77 
.Iff 




20 


■ £40 


. SS6 


. S41 






40 


.257 


.245 


.255 


.261 




40 


.192 


.197 


.192 


.187 




60 


.282 


.272 


.283 


.283 




60 


.163 


.169 


.167 


.160 


1.0 


10 


-.072 


-.083 


-.050 


-.041 


1.0 


10 


.272 


.294 


.270 


.246 




20 


-.003 


-.004 


.009 


.014 




20 


.215 


.232 


.217 


.203 




Art 


£ R 

.0 09 


.033 


.040 








. 1.64 


.171 


. 1 04 


. 1 97 




■■■ 60 


.fJ58 


.044 


.056 


.056 




60 


.136 


.140 


.136 


.133 


2,0 


10 


-.118 


-.130 


-.131 


-.092 


2.0 


10 


.250 


.279 


.251 


.219 




20 


-.091 


-.110 


-.105 


-.082 




20 


.189 


.202 


.188 


.175 




40 


-.083 


-.082 


-.074 


-.069' 








* 1 44 


* 1 37 


.131 




60 


-.061 


-.067 


-.058 


-.072 




60 


.116 


.119 


• 114 


.110 








Score 2 










Score 


5 




0.5 


10 


.369 


.374 


.375 


.325 


0.5 


•10 


• 231 


• 245 


• 232 


.215 




20 


.465 


.484 


.461 


.440 




20 








. • / V 




40 


.536 


.536 


.535 


.527 




40 


.143 


.147 


.144 


• 141 




60 


.559 


.559 


.562 


.556 




60 


.121 


.125 


.123 


• 120 


1.0 


10 


.261 


.263 


.277 


.253 


1.0 


10 


.225 


.240 


.222 


.205 




20 


.331 


.346 


.335 


.319 




20 


.170 


.181 


.170 


.161 








.370 


.364 


.398 






. Iie6 


. 1 30 


. 1 26 


1 A 1 

.121 




60 


.378 


.377 


.378 


.369 




60 


.104 


.106 


.104 


.102 


2.0 


10 


.250 


.263 


.237 


.233 


2.0 


10 


.211 


.233 


.212 


.189 




20 


.266 


.270 


.255 


.251 




20 


.155 


.166 


.155 


.145 




W 


. 270 


.286 


.278 


. 269 






.112 


* 1 17 


tit 

.111 


. 107 




60 


.290 


.295 


.294 


.270 




60 


.094 


' .095 


.092 


.089 








Score 3 










Score 


6 




0.5 


10 


-.001 


-.004 


.005 


-.045 


0.5 


10 


.734 


• 739 


.741 


.740 






— . BBC 


.001 


-.006 


_ MOT 




?A 


. 01 B 


All 


.619 


.01-6 




40 


0 000 


.001 


.000 


-.003 




40 


.478 


.480 


.481 


.482 




60 


.004 


-.006 


-.006 


-.005 




60 


.407 


.409 


.409 


.410 


1.0 


10 


-.023 


-.027 


-.014 


-.060 


1.0 


10 


.518 


.530 


.537 


.545 




20 


-.005 


-.001 


-.007 


-.028 




20 


.392 


.400 


.403 


.406 




40 


.015 


-.001 


-.004 


-.005 




40 


.288 


.289 


.290 


.292 




60 


.004 


-.000 


.003 


-.011 




60 


.237 


.238 


.239 


.239 


2.0 


10 


.012 


.003 


-.030 


-.048 


2.0 


10 


.358 


.377 


.336 


.400 




20 


• 005 


.003 


-.019 


-•033 




20 


.251 


.254 


.259 


.263 




40 


-.008 


.004 


-•007 


-.012 




40 


.172 


.174 


.175 


.176 




60 


.002 


.007 


.007 


-.013 




60 


.139 


.140 


.141 


.141 
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Analysis of Flxed^Length Stradaptive 



Descriptive Statistics 

• 

Table 8 presents the means of Fixed-Length Stradaptive scores as a function 
of item discrimination! test lengthy and initial ability estimate* Table 8 shows 
that» as with Variable-^Leogth Stradaptive » the means of scores 1 and 2» nhe average 
difficulty scores » tended to decrease as item discriminating power increased; 
Score 3| the Bayesian score^ showed ho such tendency^ however. This was probably 
because the Bayesian score is not directly related to the "optimal" item difficulty 
discussed previously* Means of scores 1 and 2 tended to increase^ as the tests 
became longer^ at all levels of item discrimination* The Bayesian score^ score 
3 1 showed no trend with respect to test length* Score 2 means increased as the 
validity of initial ability estimates increased » but scores 1 and 3 evidenced 
no such trend* 

The corresponding error predictor scores » scores 4^ 5 and 6| decreased both 
as item discriminating power and test length increased* These results were 
consistent with the desired characteristics of a score designed to predict pre- 
cision of measurement; errors of measurement of test scores should decrease as 
both test length and item discriminating power increase* 

Means of scores 4 and 5 decreased as the validity of the initial ability 
estimate improved^ but score 6 showed a slight increase* It is not clear 
whether this was because the initial ability estimate has no advantageous effect 
on precision of measurement or because the Bayesian error score does not 
predict errors of measurement* 

Table 9 presents the standard deviations of Fixed-Length Stradaptive's 
scores* The Bayesian ability score^ score 3^ showed increased variability as 
item discriminations increased » but no trends wit;h respect to item discrimina- 
tions were apparent for scores 1 or 2* The trend in score 3 may be because the 
Bayesian procedure used Implicitly regresses scores toward the mean and this 
regression is less pronounced as items get more discriminating and measurement 
becomes more precise* 

All three ability scores showed increasing variability with increasing test 
length, when a fixed entry point was used* However, for scores 1 and 2, variability 
seemed to be a joint function of number of items, item discriminations, and the 
validity of initial ability estimates* For these scores, with items of low 
discrimination, variability decreased as the number of items increased* Better 
initial ability estimates were associated with decreased variability for the first 
two scores, but for the Bayesian ability score variability increased with increasing 
numbers of items * 

Variability of score 6, the Bayesian error predictor score, increased with 
increasing item discriminations* Scores 4 and 5, the average difficulty error 
predictor scores, showed only slight decreases with Improved discriminations* 
All error predictor scores showed decreasing variabilities as test length increased. 
Scores 4 and 5 showed decreasing variability ad the initial ability estimates 
improved, but score 6 showed a slight increase* 
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Table 9 

Standard Deviations of Scores for Fixed**LeTtgth Stradaptlve 
so a Joint Function of item DlscrlBlnatlonfi (a) 
and Validity of initial Ability Estlnatea 



Initial Ability , ; Initial Ability 

No. Fixed Correlation Ho. Fixed Correlation 



<* 


T t Aina 




0.0 


O.S 


1 n 


<* 




ontry 


0.0 


0.5 


1 n 
1 * u 








Scor« 


1 










Score 4 




0.5 


10 


.S44 


.949 t.t2e 


1.090 


0.5 


10 


.101 


.144 


.115 


.093 




20 


.678 


.914 


.995 


1.050 




20 


.067 


.078 


.068 


.069 




40 


.882 


.891 


.947 


.988 




40 


.042 


.046 


.045 


.044 




bO 


.894 


.882 


.899 


.951 




60 


.032 


.033 


.034 


.034 


1 f\ 


10 


.096 


.682 


.984 


1 .060 


1*0 


10 


. 095 


.134 


.102 


.004 




^0 


.093 


.911 


.967 


1 .040 




ZO 


.056 


.071 


.060 


.052 




40 


.913 


.906 


.967 


.991 




40 


.035 


.041 


.035 


.033 




60 


.927 


.938 


.951 


.977 




60 


.023 


.026 


.025 


.024 




10 


• 67 1 


.897 


.996 


1 .940 


1 n 
z * 0 


1 n 

10 


.095 


1.538 


.096- 






^0 




.935 


.990 


1.010 




on 
zo 


.056 


.074 


•057 


*0,O3 




40 




.942 


.979 


1.020 




40 


.035 


.040 


.035 


.033 




60 


.952 


.937 


.974 


.983 




60 


.039 


.029 


i024 


.022 








Score 


2 










Score S 




0.5 


10 


.790 


.895 


.976 


1.060 


0*5 


10 


.060 


.076 


.066 


.055 




20 




.673 


.961 


1.020 




20 


.041 


.048 


.044 


.045 




40 


.646 


.849 


.906 


.946 




40 


.026 


.029 


.030 


.030 




60 


.856 


.644 


.866 


n t t 

.91 1 




£n 
bO 


.020 


.021 


.022 


a A 4 

.023 


1*0 


10 


Ann 

.793 


.830 


.959 


1 .050 


1*0 


1 n 
10 


.051 


.069 


.056 


.049 




zo 


.069 


.67 6 


.941 


1 .020 




in 
ZO 


.032 


.042 


.036 


.9 JlC 




40 


.889 


.683 


.945 


.966 




40 


.020 


.024 


.020 


.021 




60 


.901 


.913 


.926 


.957 




60 


.014 


.016 


.815 


.015 


2*0 


10 


.624 


.851 


.966 


I .030 


Z*0 


1 n 
10 


.051 


.074 


.056 


.W4e 




20 


.686 


.899 


.964 


1 .000 




on 
zo 


.029 


.042 


.031 


.Wey 




40 


.921 


.919 


.957 


.996 




40 


.018 


.023 


.020 


.019 




60 


.931 


.916 


.953 


.961 




60 


.032 


.016 


.014 


.014 








Score 


3 










Score 


6 




0*5 


10 


.687 


.662 


.689 


.665 


0*5 


10 


.044 


.048 


.054 


.059 




20 


.791 


.796 


.795 


.792 




20 


.040 


.042 


.046 


.050 




40 


.885 


.664 


.696 


.899 




40 


.029 


.030 


.032 


.033 




60 


.929 


.913 


.662 


.922 




60 


.024 


.024 


.024 


.026 


1.0 


10 


.833 


.824 


.864 


.655 


1*0 


10 


^ff63— 


.079 


.090 


.106 




20 


.906 


.917 


.923 


.937 




20 


.047 


.053 


.057 


.061 




40 


.943 


.946 


.972 


.969 




40 


.029 


.031 


.033 


.034 




60 


.971 


.980 


.966 


.977 




60 


.021 


.022 


.023 


.023 


2.0 


10 


.934 


.928 


.952 


.912 


2*0 


10 


.078 


.101 


.120 


.149 




20 


.949 


.973 


.966 


.960 




20 


.050 


.051 


.059 


.060 




40 


.975 


.962 


.982 


1.000 




40 


.024 


.024 


.025 


.025 




60 


.985 


.970 


.995 


.965 




60 


.017 


.017 


.016 


.016 



o 
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Correlatlonal Analyses 

Table 10 shows a matrix of Intercorrelatlons among Fixed-Length Stradaptlve 
test scores. As the table shows^ the three ability scores correlated very 
highly; the two average difficulty scores, scores 1 and 2» correlated almost 
perfectly (l.e.» .999) and the Bayeslan score correlated .992 and .993 with 
them* Scores 4 and 5» the error predictor scores corresponding to the 



Table 10 

Intercorrelatlons Among Fixed-Length Stradaptlve Scores 





Ability Scores 


Error Predictor Scores 




1 


2 


3 


4 


5 


6 


1 


1.000 


.999 


.992 


-.079 


-.169 


.450 


2 


.999 


1.000 


.993 


-.090 


-.173 


.446 


3 


.992 


.993 


1.000 


-.166 


-.246 


.407 


4 


-.079 


-.090 


-.166 


1.000 


.958 


.351 


5 


-.169 


-.173 


-.246 


.956 


1.000 


.327 


6 


.450 


.446 


.407 


.351 


.327 


1.000 
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average difficulty scores^ correlated highly with each other (•958) but low with 
the Bayeslan error predictor score (*35l and *327| respectively)* Error 
predictor scores 4 and 5 correlated very' slightly and negatively with the three 
ability scoreSi but score 6| the Bayeslan error predictor score^ correlat&d 
moderately with all three ability scores* 

The error predictor scores were designed to be Independent of the ability 
scores. The fact that the Bayeslan error score correlated with the ability 
scores suggests that It might not be as useful as the others as a measure of 
error. This result may be due to the guessing probability being non-zero In the 
model» which may have affected the Bayeslan error score. 

Table 11 shows the correlations of the three Fixed-Length Stradaptlve 
ability scores with the generating ability. These correlations Increased for 
all scores as the Item discriminating power and test length Increased. For the 
average difficulty scores, scores 1 and 2, these correlations Increased as the 
validity of the Initial ability estimate Increased. As can be seen, this 
trend diminished as test length Increased. Ho definite trend with respect to 
initial ability estimates was observed in the Bayeslan score. This Is because 
a constant prior (l.e^, population parameters) was given to the Bayeslan 
scoring routine while the average difficulty scores Implicitly Incorporated the 
initial ability estimate Information. The capability to explicitly use prior 
information could easily be added to the. Bayeslan score but this capability Is 
not without its disadvantages (e.g*, the goodness of the prior must be stated 
explicitly and this will allow^ a poor prior to bias the score). 



to 

* 

O 



0\ K> H» 



0\ ^ to 0\ to H» 



0\ <C> lo H 



0\ to H» 



0\ <C> lO M 
O O O O 



0\ -C^ to H» 0\ -C^ to H» 



0\ to H» 



CO 



01 01 Ok — 
« u « 



01 

•J m 

W 09 Ql 



01 -J o* 
ro o o CD 

S 01 



« « « 
01 o» W 4 
^ CD m 



CD 

*4 w ^ » 

« » ta 



01 *4 0« 
«4 09 09 

to ro to o 



09 O ** 

CB m o o 



« 01 

fO « w ^ 



^ 09 *4 O 

01 o« « o« 

<4 ^ *4 



^ 09 O ^ 
<B m '-4 09 



« « « 09 
<4 W U 
CO ^^.'O 01 



« 09 <4 0« 
<4 ^ <4 

<4 ^ xrxa 



« « 0) 
09 <4 ^ O 
IB to O 



« « « 0) 

o« » s to 

W U 09 » 



0) *4 0« 
CCk W A' 
O O m 



« 09 

<4 o« to (a 

O O 



09 *4 
0« U *4 to 
*4 



09 09 *4 m 

» to W 

» « fO 



09 *4 en <& 

o ^ to o 



« 01 

o« en A » 
o« u o« » 



01 <4 <4 
Gi O to 
^ O 09 ^ 



« « « 

09 A' 

en 09 en en 



en fO s 
09 09 en en 



01 09 09 

01 en u 

0« <4 0« 



04 

09 *4 » 09 

a o« en 



« 0) 
0« » 01 

en t& en €4 



S 01 -J o* 
» m o m 

Ob 



•J o to Ci 

03 O S 



« 09 *4 

o o 

fO <4 U 



« 09 *4 tn 
Q ^ to 0« 
V & M 01 



« « « 09 
09 *4 » « 
to 09 



01 

o« en 41 » 
o to (n 



01 <4 *4 

o« 

O O '-4 ^ 



€» <4 U 



o« en ^ « 

01 Q» » 



09 01 01 

fo 01 en u 



o 



S: S* 

rt o ^ 



ti O CO 



I 



O ft 



rn gt ai 



I 



1 



-31- 



Correlations of scores 1 and 2 with generating ability were higher when 
using a fixed entry point than when using a variable entry point with invalid 
prior information. A similar result was observed* in general* for the Bayesian 
score. Score^ability correlations for scores 1 and 2 using a variable entry 
point were» in general » higher than those using a fixed entry point when the 
initial ability estimate correlated .5 with generating ability. When initial 
ability estimates correlated 1..0 with actual ability» score-ability correlations 
for scores 1 and 2 were always higher than with a fixed entry point. But» in 
general) the advantage of prior information diminished as test length increased. 
Score 3^ the Bayesian score» usually resulted in higher score^ability correlations 
when a fixed entry point was used than when prior information was available 
regarding a testee*s ability level. 

Of the two average difficulty scores » score 2^ the average difficulty of 
all items administered* correlated higher with generating ability than did . 
score 1 for tests with items having discriminations of 1.0 or 2.0. Score 1» the 
average difficulty of all items answered correctly* correlated higher with 
ability for tests with items having discriminations of .5. Score 3» the 
Bayesian score» correlated higher with ability than either of the average diffi-* 
culty scores when no prior information was available* regardless of item dis- 
criminations. In general) score 3 also correlated as high or higher than did 
the average difficulty scores for tests having 40 or 60 items or item diacrim* 
inations of 2.0 even when prior information was available. 

Comparison with the conventional test . The fairest general comparison be- 
tween Fixed-Length Stradaptive scores and conventional test scores is with no 
prior information* since the conventional test cannot use prior information. 
The best Fixed-Length Stradaptive score for those conditions was the Bayesian 
score. 

Comparing Tables 1 and 11» when discriminations were .5» the conventional 
test correlated higher with ability for all lengths shorter than 60 items. When 
discriminations were 1.0» the stradaptive test correlated higher at all lengths 
greater than 10 items. When discriminations were 2.0» the stradaptive test 
correlated higher with ability than did the conventional test at all of the 
four lengths investigated. For the 10-it4m tests» the stradaptive score 3 
correlation with ability was .919* while that for the conventional test was .888; 
at 60 items the respective correlations were .989 and .926. 

Although the comparison with the fixed entry point is of most interest in 
a pure research situation* it is also appropriate to compare the usual modes 
of implementing the two testing strategies, i.e., the conventional test without 
prior information and the stradaptive test with a moderately valid prior 
ability estimate (i.e., .50). Under these circumstances, score-ability 
correlations were higher for the stradaptive test using all scoring methods with 
highly discriminating items. Using moderately discriminating items, stradaptive 
correlations were higher, in general, for tests longer than ten items. For items 
with low (a=.5) discriminations, the conventional test scores correlated higher 
with ability. 

Information Analysis 

Table 12 presents average heights of information curves for the Fixed-Length 
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Table l2 

Average lofonnation Provided by Revised Stradaptive Scores as a 
Joint Function of it«a Diacrinlnatiooo <a)» Test Lengthy 
and Validity of loitisl Ability Estiatates 











Initial Ability 




Ko. 


Fixed 




Correlatioa 




a 


Iteats 


Entcy 


0.0 


0.5 


1.0 








Score 1 




0.5 


10 


.645 


.943 


.924 


1 .943 




20 




.913 


1.642 


2.442 




40 


8.883 


S.e60 


2.989 


3.708 




60 


: 4. £34 


3.7SI 


4.595 


5.191 


1*0 


10 


t . AAA 


.823 


2.034 


3.990 


in 
ZD 


V . « Viv 


S.S47 


4. 122 


5.825 




40 


ft- 991 


6.049 


8.494 


10.194 




60 


1S.83S 


9.963 


12.721 


15.262 


2*0 


10 


a. AAA 


1.413 


3.530 


8.184 


20 




4.SI0 


7.833 


13.286 




40 


9a _ M7 


11. 70S 


17.836 


23.958 




60 


S9.94S 


18.'7SS 


26.402 


32.924 








. Score 2 




0*5 


10 


.601 




.919 


8.077 




20 


1.368 


.902 


1.647 






40 


S.779 


2.S99 


2.999 


3.70t 




60 




3.809 


4.625 


5.179 


1.0 

X* u 


10 


1.80S 


.763 


2.039 


, 4.349 




20 


3.B6I 


S.319 


4.281 


6.81 1 




40 


B.608 


6.350 


8.937 


10.808 




60 


13.468 


I0.7S1 


13.611 


16.438 


2.0 


10 


4.447 


1.311 


3.540 


9.459 






10. 180 


4.SS7 


8.862 


1 5.414 






33.014 


13.729 


19.574 


88.061 




60 


33.283 


20.7S7 


29.2S4 


37.86d 








Score 3 




0.5 


10 


.769 


.748 


.833 


.647 




20 


I.7S1 


I.S86 


1.719 


t tot 
1 .797 




40 


3.34S 


3.328 


3.348 


3.474 




60 


4.989 


S.0S7 


5.141 


5.279 


1.0 


10 


S.S04 


t.9Sl 


2.444 


2.496 


20 


4.S6S 


4.S79 


5.155 


5.268 




40 


10.43S 


10.065 


10.994 


11.054 




60 


16.643 


16.357 


16.948 


17.059 


2.0 


10 


5.100 


4.091 


4.885 


5.702 


^0 


ie.3S4 


11. 887 


12.271 


13.139 




40 


S9.I51 


S8.698 


30.753 


33.455 




60 


46.746 


44.496 


48.183 . 


50.570 
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Stradaptlve scores* As was observed with the score-ability correlations, 
average Information Increased with Item discrimination and test length for all 
scoring methods* ' 

Whereas score-ability correlations Increased with higher validity of the 
Initial ability estimate only for scores 1 and 2, average Information showed 
that Increasing trend for all three scores* An advantage of the Bayeslan score 
with respect to this trend Is observed In Table 12* Since, as implemented, the 
Bayeslan scoring procedure used a constant prior regardless of the testee*s 
entry point, the effects of poor prior information resulting In an Inappropriate 
entry point were negligible* whereas the effect on an average difficulty, score 
was substantial* As an example, the average Information from a 60^1tem 
stradaptlve test provided by score 2 dropped from 33*283 to 20*727 as a fixed 
entry point changed to a random entry point which was uncorrelated with actual 
ability* Under the same conditions, the average Information provided by the 
Bayeslan score dropped only slightly from 46*748 to 44*496* 

Score 2 generally provided higher levels of Information than did score 1 
when Item discriminations were 1*0 or 2*0* Although score 1 correlated higher 
with ability than did score 2 when Item discriminations were low (a°*5), 
this result did not occur when average Information values were compared* As 
with score-ability correlations, the Bayeslan score provided the highest level 
of average Information of the three scores when no prior Information was 
available* Furthermore, It provided the highest level of Information when 
initially ability estimates correlated *5 with ability, except for a 10-lt^ 
test with item discriminations of *5* Even when the average difficulty scores 
had prior Information correlating 1*0 with ability and the Bayeslan score used 
no prior Information, the latter scoring method provided a higher level of 
average Information when test length was 60 and the Items were moderately or 
highly discriminating* 

Table 13 presents the coefficients of variation for the height of the 
information ^curves of the th ree Fixed^Length Stradaptlve ability scores* One 
trend was apparent In all three scores: equlpreclslon decreased (and hence the 
coefficient of variation Increased) as the Items became more discriminating* 
For example, under the fixed entry point condition, score 1 coefficients 
Increased from an average of about 34 to an average of about 64 as discriminations 
Increased from *5 to 2*0< 

No consistent trends were apparent with respect to test length, although 
equlpreclslon appeared to improve for all scores as test length Increased when 
prior information was either not used (l*e*, fixed point entry) or was very 
poor (l.e,, correlation 0*0 with ability)* Coefficients of variation for scores 
1 and 2 showed a U-shaped quadratic trend with respect to improving Initial 
ability estimates. This trend Is not easily explained, did not generally 
appear using the Bayeslan score (score 3), and may Involve a more complex Inter- 
action with test length* For example, the trend was readily apparent for 
score 1 with a length of ten It^s and discriminations of *5, but flattened out 
somewhat, under the same conditions, for score 2* Coefficients of variation 
for the Bayeslan score. Instead of showing this trend, showed a monotonlc de- 
creasing trend with increasing goodness of Initial ability estimates within 
levels of item discrimination and test length* 
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Table 13 

Coefficieacs of VarlaCion for laformacion 
FuDCCions of Flxed-Lensch Stradapcive Scores 
aa a Jolitc Functioa of icem DiacrluiiutcloDe (a), 
Teac Leogch, and Validity of idiclal Abilicy Esclinacea 











iQltlal Ability 






Ho. 


Fixed 




CorreUtion 




a 


Items 


Entry 


0.0 


0.5 


1.0 








Score 1 




0.5 


10 


40.218 


36.915 


25.681 


42.712 




20 


36*41 S 


32.185 


21 .876 


33.565 




40 


31 • 1 92 


27.389 


24.808 


29.355 




60 


29*637 


27.206 


27.595 


29.680 


1.0 


10 


53.556 


60.132 


36.237 


32.839 




20 


58.029 


49.366 


34.918 


37.824 




40 


46.274 


41.329 


35.869 


36.683 




60 


44.351 


39.448 


38.758 


37.147 


2.0 


10 


65.532 


69.094 


47.545 


45.275 




20 


62.176 


57.474 


47.305 


56.363 




40 


62.566 


49.879 


46.948 


55.433 




60 


65. 565 


47.057 


51.949 


60.603 



Score 2 



0.5 


10 


32. I 32 


36.367 


17.789 


42.31 1 




20 


26.793 


31.372 


16.064 


31.008 




40 


22.350 


24.290 


16.027 


25.036 




60 


20.002 


22.160 


18.417 


23.002 


1.0 


10 


49.193 


65.831 


28.147 


^2.209 




20 


44.836 


53.478 


26.024 


34.334 




40 


37.983 


40.565 


26.867 


30.773 




60 


35.312 


37.427 


30.882 


29.794 


2.0 


10 


64.505 


82.484 


4 5.660 


41.149 




20 


59.592 


66.010 


43.873 


52.664 




40 


60.093 


55. 599 


41.947 


51.329 




60 


60.594 


51.547 


46.137 


54.752 








Score 3 




0.5 


10 


26.370 


26.023 


15.656 


12.109 




20 


22.003 


20.847 


11.664 


8.255 




40 


16.969 


17.288 


12.651 


1 1.397 




60 


14.848 


15.880 


14.130 


13.401 


1.0 


10 


40.828 


42.653 


18.994 


14.132 


20 


33.230 


36.038 


17.937 


10.805 




40 


24.806 


25.312 


17.257 


13.561 




60 


21.786 


22.618 


15.961 


11.990 


2.0 


10 


54.820 


56.473 


28.118 


26.204 




20 


42.800 


45.162 


27.990 


1.4.572 




40 


38.263 


38.061 


30.573 


17.805 




60 


34.1 14 


34.516 


27.272 


24.948 
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Comparison with the conventional test > Comparing Information provided by 
a fixed entry point stradaptlve test using the Bayeslan score with the Information 
provided by a conventional test (Table 1) , the stradaptlve test always provided 
a higher average level of Information. At no combination of test length and Item 
discrimination did the Information function of the conventional test have a 
higher average value than that of score 3 of the stradaptlve test. Stradaptlve 
scores 1 and 2 (with fixed entry) also provided higher average levels of 
information than did the conventional test, for all test with Item discriminations 
of 1.0 or 2.0. When the stradaptlve test utilized valid prior Information, Its 
average level of Information exceeded that of the conventional test across all 
test lengths, levels of discrimination, and scoring methods. 

Equlpreclslon of measurement provided by the FlxedrLength Stradaptlve test 
was superior to the equlpreclslon provided by the conventional test In the 
vast majority of comparisons. Equlpreclslon provided by the Fixed-Length Stra- 
daptlve Bayeslan score was superior to that of the conventional test score In all 
cases where test lengths and Item discriminations were matched. The lowest 
coefficient of variation (I.e., the best equlpreclslon) generated by the conven- 
tional test was 39.617, for a 60-ltem test with Items of .5 discrimination. 
The Bayeslan score of Fixed-Length Stradaptlve provided belter equlpreclslon than 
this In all but six out of 48 conditions; In each case wher. the conventional test 
was more equlpreclse, the stradaptlve test was composed of 20 Items or fewer. 

Figure 1 

Inforaaatlon Curves for 4p-It€» Conventional and 
Stradaptlve Tests Using Items with Discriminations of a^0,5 
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Q In general, the use of valid prior Information within Fixed-Length Stradaptlve 
j^py^^esulted In greater equlpreclslon of metasjtirement. For score 1, the most equl- 
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precise measurement was observed for Initial ability correlations of .*5* For 
scores 2 and 3, increasingly valid entry point information resulted in more 
equiprecise measurement with items of moderate (a=l*0) and high discriminations 
(a=2*0), and for low discriminating items (a=.5) for score 3. With a few exceptions, 
a fixed entry point was better than an invalid (2*^0*0) variable entry point* 

Graghic ccmpgrison of information curves . Figures 1, 2 and 3 show graphically 
the effect of different item parameters and different validities of prior information 
on the information curves of the Stradaptive Bayesian score; for comparison 
purposes, these figures also include information curves for conventional test scores* 
Figure 1 shows information curves based on item discriminations of a^*5 for a 40- 
item conventional test and for 40"item stradaptive tests with different initial 
ability estimate validities* With such low item discriminations, all tests resulted 
in very low and flat inftr:cmation curves* 

Figure 2 

Information Curves for 40-Item Conventional and 
Stradaptive Tests Using Items with Discriminations of a=l*0 



8.0 
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Figure 2 shows the same curves for tests with items having discriminations of 
a=*l*0* Several trends are apparent; 1) the conventional test provided better 
measurement than the stradaptive tests in the middle range of the ability range, 
but less precise measurement in the extremes; 2) the stradaptive tests provided 
more equiprecise measur^ent (i*e*, flatter curves) than did the conventional test; 
and 3) the initial ability estimates had an ^effect primarily at low ability levels* 
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Flgure 3 shows the curves for tests using very discriminating items 
(i.e., a^2.0). The information curves were again higher and the three 
observations made for Figure 2 were even more obvious* 

Figure 3 

Information Curves for 40*Item Conventional and 
Stradaptive Tests Using Items with Discriminations of a=2.0 




The general upward trend in the curves observed in Figures 2 and 3 
results from the *20 guessing parameter used In the simulations* It is interesting 
to note, however, that the maximum values of Information achieved by the stra- 
daptive test at the high ability levels are essentially equal to the highest 
information values achieved by the conventional test* 

Termination Criterion Analysis 

Table 14 shows correlations of the three error predictor scores with absolute 
deviations from the line of relations between standardized ability and ability 
scores* Error scores and ability scores in Table 14 are ordered such thst the 
error scores correspond to the ability score on the principal diagonal of the cor- 
relation matrix* 

The data in Table 14 show that; 1) the correlations increased as the item 
discriminating powers increased; 2) the error scores were not necessarily most 
predictive of the corresponding ability score on .the diagonal; and 3) the 
correlations were, in general, quite low* When item discriminations were *S, the 
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correlations were so low as to suggest that no Informatloa about precision of 
measurement Is provided by the error scores. The correlations of .205 and .250 
suggest that when Item discriminations are 1.0, error scores 4 and 5 might be of 
slight utility In predicting precision of measurement. The Bayeslan error score, 
score 6, was slightly correlated with errora of the two average difficulty ability 
scores but essentially unpredlctlve for errors of the Bayeslan ability score, 
score 3. When discriminations were 2.0, error scores 4 and 5 were slightly more 
predictive of errors In their corresponding ability scores than they were when 
discriminations were 1.0. The Bayeslan error score was still the least predictive 
of the three. 

Table 14 

Correlations Between Error Predictor Scores and 
Absolute Deviations from the Line of Relations Between 
Ability and Ability Scores, as a Function of Item D;Lscrlmlnatlons (a) 



a 


Ability 
Score 


4 


Error Scores 
5 


6 


0.5 


1 


-.020 


-.013 


-.052 




2 


-.018 


-.005 


-.045 




3 


-.028 


-.038 


-.004 


1.0 


1 


.205 


.231 


.168 




2 


.221 


.250 


.187 




3 


.013 


.007 


.062 


2.0 


1 


.350 


.398 


.191 




2 


.323 


.364 


.149 




3 


.247 


.249 


.179 



Using the correlation of error scores with ability score deviations as an 
evaluative criterion. It appears that none of the three criteria Investigated are 
very useful for estimating the magnitude of measurement errors. However, a , 
different criterion might provide different results. Furthermore, no trait of 
Inconsistency or unpredictability was programmed Into the response model. The 
analysis, therefore, was sensitive only to Imprecision Introduced through for- 
tuitous response Instability affecting the psychometric properties of the testis. 
Live testees might be predictably Inconsistent and thus demonstrate better validities 
for the error scores. A future study might compare shapes of Information curves 
provided by tests terminated under these criteria as well as the original criterion 
suggested by Weiss (1973). The best termination criterion would. In that situation, 
be the one which produced the most equlpreclse measurement at all levels of 
ability. Such a study might also Investigate models of a trait of Inconsistency 
or unpredictability. 

SUMMARY AMD CONCLUSIONS 

Conventional Test 

Two psychometric characteristics of the conventional test are worthy of 
note^ First, the score*ablllty correlation Is not a monotonlcally Increasing 



function of Item discriminating power. For conventional tests longer than 20 
Items, the score-ability correlation decreased as Items became more discriminating 
than ^^1,0, The second observation Is that equlpreclslon of measurement with ^ 
conventional test Is a monotonlc decreasing function of Item discrimination. These 
txjo observations combined suggest that If sufficiently discriminating Items are 
available, the conventional test provides precise measurement for so few people 
that Improving the quality of the Items does not Improve the quality of the 
measurentent, 

Variable-Length Stradaptlve 

The original form of the stradaptlve test showed characteristics In this 
study similar to Its characteristics as derived from previous live-testing 
studies* Means and standard deviations of scores obtained using the real item 
pool were slightly different In the simulation data. This was probably due to the 
use of item parameters based on small subject groups In the llve^testlng study* 
The difference In results may, however, have been a function of the failure of 
aspects of the simulation model to adequately reflect the behavior of real 
testees * 

The length of the Variable-Length Stradaptlve test shortened substantially 
as items became more discriminating, but showed no tr^d with improving Initial 
ability estimates* Tests using the real Item pool were shorter than would have 
been expected considering the discriminations of the Item pool* This supports 
the suggestion that, under Uelss* original celling stratum termination criterion, 
test length may be decreased considerably by putting the most discriminating Items 
at the beginnings of each of the strata* 

The same clusters of scores observed In live-testing studies were observed 
In the simulation data; and, as In the live-testing data, the average difficulty 
scores, scores 8 and 9» had the highest Indices of reliability (l,e*» correlations 
with gen^atlng ability)* No definite trend in score-ability correlations was 
observed as a function of the quality of the Initial ability estimates* This 
suggests that variable entry to Variable-Length Stradaptlve does not Increase Its 
capability of reflecting true ability levels on the average. An Increasing trend 
In score-ablllty correlations with Increasing Item discriminations was noted* 
This suggests that there Is a deficiency In the termination criterion since It does 
not keep precision of measur^ent constant, as It was Intended to do* 

Variable-Length Stradaptlve vs* Conventional 

Comparing Variable-Length Stradaptlve to the conventional test, the best stra- 
daptlve score, score 8> Iiad higher score-ability correlations only When Item 
discriminations were higher than a='l*0* The same observation was true for average 
Information* Better equlpreclslon of measurement, on the other hand, was always 
provided by the stradaptlve test> with coefficients of variation for Information 
functions of the conventional test sometimes being more than twice as large as 
those of the stradaptlve test operating under the same conditions* 

The simple question of which of the two testing strategies Is better cannot 
be answered without specifying criteria and conditions. The Variable-Length 
Stradaptlve testing strategy always provided more equlpreclse measur^ent than 
the conventional test^ but provided more average Information and higher score- 
ability correlations only when the Items were highly discriminating* 
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Flxed-*Length Stradaptlve 

Intercorrelatlons among Fixed-Length Stradsptlve ficores revealed that 
ficores 1 and 2» respectively the average difficulty of items correct and 
administered, were nearly identical, correlating .999* The Bayesian ability 
score also correlated highly with the first two (p-; 993 and •992)* Error 
predictor scores 4 and 5, the error scores corresponding to ability scores 1 
and 2^ correlated highly among themselves, moderately with the Bayesian error 
score, and poorly with the ability scores. The Bayesian error score correlated 
hi^er with the ability scores than with the other error scores* 

Ability scores 1 and 2, for all practical purposes, performed equally well 
in terms of 8Core*-ability correlations, average information, and equiprecision* 
The Bayesian ability score performed better than the first two scores, in terms 
of score^ability correlations and average information, \^en tests were more 
than 20 items long or when initial ability correlations were less .than *5* It 
did not perform ss well in other conditions because it wss not eKplicitly 
given prior information when prior Information was available and could not use 
it implicitly as could the average difficulty scores* Given an informative 
prior the Bsyesian score would probably always be superior to the average diffi^ 
culty scores* An advantage of the Bayesian score when given only a population 
prior is that the effects of poor prior information resulting in an inappropriate 
entry point are negligible whereas the effect on the average difficulty scores 
is great* Although no score was best with respect to average Information under 
the conditions studied, the Bayesian ability score always provided more equl- 
precise measurement than did the average difficulty sbor^s* 

Fixed-Length Stradaptlve vs* Conventional 

When compared to the conventional test, the Bayesian score, which was 
generally the best score of Flxed**Length Stradaptlve, correlated higher with 
ability than did the conventional test score when tests were long or discrimin- 
ations were hi^* For item discriminations of a»*5, the stradaptlve test's 
Bayesian score correlated higher with ability than did the conventional test 
score when the test length was 60 Items* When item discriminations were 1*0, the 
Bayesian score correlated higher when. teats were longer than 10 items* When 
discriminations were 2*0, the Bayesian score correlated higher with ability 
than did the conventional test at all test lengths* The Bayesian fjcore of 
Fixed-Length Stradaptlve always provided higher average information and better 
equiprecision than did the conventional test when Item discriminations and test 
lengths were equated* 

Termination Criteria 

Error predictor scores investigated in conjunction with Fixed*^Length 
Stradaptlve appeared to provide little information about errors of measurement , 
although slight correlations with absolute error from the line of relations 
were observed when item discriminations were high* These data do not lend 
support to the Idea of using these error scores as termlnstlon criteria* How- 
ever, an alternative approach for future research on termination criteria might 
be to terminate tests on the basis of the various criteria and compare shspes 
of the resulting information curves* The best termination criterion would 
produce the flattest information curve* 
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Conclusions ^ 

The data support the contention that the stradaptive testing strategy 
can produce better measurement than comparable conventional tests in terms of 
amount of information provided^ equality of information provided at different 
ability levels » and ^ some conditions » in terms of correlations of scores with 
ability* These advantages become even greater as item discriminations improve* 
The data further suggest that 1) the Bayesian scoring technique «i3 a very good 
method for scoring the stradaptive test; 2) the use of prior information to 
provide variable entry points into the fixed-length stradaptive test generally 
Improves the measurement characteristics of the resulting scores; and 3) further 
research is needed to develop and refine flexible termination criteria for the 
stradaptive testing strategy* 
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APFENDIX A 



A Fortran IV Bayeslan Scoring Routine 



SUBROUTINE BSCOR (BTHeT«BVAR«D2F«D2S«2RESP) 
C CALLING PARAMETERS 



C— — GUESSING PARAMETER SET TO 0.8 HERE 
6U£SP-0.a 

D-<DIF-BTHET)/SORT<a.B#<|.0/DIS##a+BVAR)) 

ERFD-a*0«CDPN(D«1.4U&l>-1.0 

EDSQ-EXP(D««a) 

EDSQI-1.0/EDSQ 

XK1NV-0.5«(1.0-ERFD) 

XLINV-GUESP+(1*0-6UESP>«XKINV 

XL-1.0/XLINV 

IF (IRESP .NE. 1) 60 TO 10 

S-0.39894a«(SQRT(BVAR)/SQRT('1.0+(1.0/DIS««2)« 
+1.0/BVAR))«(1.0/XKINV)«EDSQI 
T-i .0-1 .7724S4«D«EDSfl«( 1 .0-ERFD) 
BTHET-BTHET+(1.0-GUESP)«XKINV«XL«S 
BVAR-BVAlt-(1.0-QUESP)«XKINV«XL*S««a«(7-GUESP«XL> 
RETURN 

10 BTHET-BTHET-0.797eeS«(BVAR/SflRT(1.0/DIS««2+ 
+SVAR) )«EDSQI«( I .0/(1 .0+ERFD) ) 
PART1"1 .188379/(1 .04(1 .0/DIS)t<«2)«( 1.0/BVAR)) 
PART2-1 .0/(EDSQ«( 1 .0+ERFD) )«*2 
PART3-0.S64190+D«EOSQ«(1.0+ERFD) 
BVAR-BVAR«(1.0-PART1«PART2«PART3) 
RETURN 
END 



C 

C 

C 

C 

c— — 



BTHET I 
BVAR t 
DIF a 
DIS t 
IRESP t 



MEAN OF PRIOR ABILITY DISTRIBUTION 
VARIANCE OF PRIOR ABILITY DISTRIBUTION 
B-VALUE OF ITEM 
A-VALUE OF ITEM 

RESPONSE — 1 - CORRECT* 0 - INCORRECT 



ERIC 



48 



i; 
I 
I 



• ••• 

01^ noi c»cScScS<3clicS<3ncli 

• ••• ••••••••••••••• 

• ••••••••••••••••••••••••• 



i 

i 

I 
I 



I 



• ••• •••••••••••••••••••••• 

• ••• ••••••••• 

» 4 I » I I » I I » I I I I I I I I I » I II II 

• ••• •••••••••••••••••••••••••••••••••••••••••••••• 

I II I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I. I I I I I I I I I I I I I I I 

^8:^ L"isS:£lsiL'-S:SS;-H:§s:g::Ssl:8l:S8ill 

**** •••••••••••••••••• •••• 

I I t TTTT7TT777TTT7TT7777T7TT7777777.T7TTTTTT 

2Ba^ss5ssSssi5^;i5|Js^^s!|S8s;^8= 



* * * 77i7777i tsT7?i7f7T7?'777 ^T?7*i*?T7?T7 

• ******** it ************ ttttttttttttt 

49. 



-44- 



Table B-2 

Difficulties (b) , by Stratum, of the Hypothetical Stradaptlve 
Item Pool Having Item Disc rloXoat Ions of a -^5 



Stratum 



1 

-8.695 

-8.663 

-2.497 

-8.338 

-2.377 

-2.520 

-2.5S6 

-2.969 

-2.709 

-2.'656 

-2.993 

-2.447 

-2.350 

-2.607 

-2.372 

-2.721 

-2.640 

-2.369 

-2.735 

-2.472 

•2.909 

-2.61 1 

-2*742 

-2.369 

-2.467 

-2.666 

-2.393 

-2.413 

-2.714 

•e.531 

-2.359 

-2.603 

-2.920 

-2.679 

-2.636 

-2.940 

-2.435 

-2.366 

-2.561 

-2.754 

-2.599 

-2.336 

-2i799 

-2.714 

-2.742 



2 . 
1 .761 
1.792 
2.271 
1.929 
1.907 
1.756 
2.115 
2.029 
2.306 
2.160 
1.S95 
S.234 
2.307 
1.664 
2.263 
1.673 
1.909 
2.277 
1.936 
2.267 
1.764 
1.969 
2.201 
2.136 
1.932 
2.142 
1.730 
2.090 
1.661 
2.064 
1.867 
2.146 
1.703 
1.756 
1.941 
2.311 
1.996 
1.S67 
1.952 
1 .909 
1.692 
1.766 
1.901 
2.261 
2.061 
2.129 



1.446 

1 .373 

1.152 

1.271 

1 .470 

I .411 

1.329 

1.361 

1.637 

1 .426 

1.321 

1.140 

1.266 

1.425 

1.127 

1.209 

1.064 

1.401 

1.407 

1.566 

1.024 

1.357 

1.502 

1.615 

1.109 

1.291 

1.411 

1.292 

1.337 

1.206 

1.135 

1.279 

1.051 

1.099 

1.333 

1.575 

1.566 

1.632 

1.345 

1.520 

1.570 

1.465 

1.171 

1.115 

1.056 

1 .317 



4 

.559 

.663 

.635 

.696 

.506 

.366 

.333 

.569 

.393 

.663 

.773 

.680 

.666 

.774 

.616 

.560 

.646 

.791 

.460 

.660 

.343 

.912 

.377 

^970 

.647 

.467 

.566 

.629 

.669 

.796 

.356 

.456 

.397 

.563 

.421 

.527 

.513 

.433 

.7 57 

.365 

.566 

.407 

.695 

.643 

.556 

.466 



5 

-"7119 
.265 
.249 

-.112 
.221 

-.123 

-.2S6 
.171 

-.036 
.026 
.227 
.020 

-.234 
.072 

-.166 
.097 
.291 
.014 

-.207 

-.259 
.216 

-.146 
.164 
.271 
.207 
.250 
.266 

-.292 
.256 

-.117 
.100 
.033 

-.269 
.074 
.327 

•.076 

•.114 
.161 
.265 
.042 
.122 
.144 

•.025 

•.274 

•.240 
.025 



6 

.6'63 

.636 

.617 

.513 

.451 

.444 

.584 

.360 

.764 

.740 

.739 

.603 

.929 

.367 

.412 

.512 

.946 

.726 

.944 

.777 

.620 

.415 

.995 

.726 

.546 

.795 

.564 

.365 

.796 

.526 

.991 

.356 

.696 

.94S 

.937 

.706 

.654 

.721 

.4£;i 

.741 
.664 
.515 
.723 
.462 
.997 
.000 



7 

1.337 

1.2S2 

1.354 

1.466 

1.562 

1.647 

1.115 

1.461 

1.024 

1.403 

1.061 

1.165 

1.640 

1.179 

1.193 

1.199 

1.453 

1.636 

1.511 

1.563 

1.121 

1.454 

1.355 

1.659 

1.305 

1.292 

1.364 

1.425 

1.344 

1.151 

1.596 

1.542 

1.239 

1.030 

1.336 

1.349 

1.649 

1.590 

1.264 

1.507 

1.256 

1.511 

1.376 

1.355 

1.377 

1.452 



S. 

1.696 

1.769 

2.102 

1.664 

.1.693 

1.766 

1.929 

2.146 

2.054 

2.165 

1.986 

1.621 

2.310 

2.310 

2.273 

2.229 

1.791 

2.163 

2.120 

2.029 

1.637 

1 .744 

2.196 

1.693 

1.779 

1 .722 

1.636 

1.619 

2.266 

2.096 

1.'705 

1.807 

2.019 

2.202 

1.604 

I .8.59 

2.316 

2.i334 

2.133 

1.974 

1.639 

2.264 

2.022 

2.151 

1.928 

2.221 



2.967 

2.366 

2.941 

2.379 

2.637 

2.575 

2.642 

2.670 

2.672 

2.511 

2.573 

2.754 

2.332 

2.645 

2.735 

2.701 

2.356 

2.906 

S.457 

2.71 1 

2.946 

2.696 

2.409 

2.764 

2.765 

2.665 

2.59.1. 

2.919 

2.476 

2.601 

2.334 

2.447 

2.753 

2.444 

2.976 

2.706 

2.925 

2.466 

2.511 

2.362 

2.992 

2.925 

2.945 

2.749 

2.955 

2.422 



Mean 
S.D. 



•2.607 
.190 



•2.000 
.192 



•1.332 
.171 



.614 
>1S6 



.032 
.190 



.663 
.200 



1.374 
.176 



1.999 
.190 



2.661 
.213 
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Table B-4 

Difficulties (b) » by Stratum^ of the Hypothetical 
Stradaptive Item Pool Having Item Discrlmliiations of a "2,0 



-2.610 

-2.953 

-2.773 

-2.696 

-2.573 

-2.993 

-2.630 

-2.916 

-2.691 

-2.666 

-2.597 

-2.653 

-2.496 

-2.633 

-2.406 

-2.577 

-2.656 

-2.904 

-2.995 

-2.691 

-2.502 

-2.357 

-2.936 

-2.992 

-2.704 

-2.334 

-2.697 

-2.612 

-2.593 

-2.635 

-2.679 

-2.621 

-2.637 

-2.519 

-2.663 

-2.346 

-2.767 

-2.726 

-2.907 

-2.733 

•2.363 

-2.767 

-2.952 

-2.637 

•2.354 

•2.701 



1.760 
8.054 
1 .974 
2.224 
1.744 
1.739 
2.324 
1.656 
2.293 
1 .622 
2.235 
2.222 
2.067 
2.303 
1.706 
1.961 
1.737 
1.642 
2.077 
1.602 
1.971 
1 .776 
2.006 
1.666 
2.102 
2.113 
2.169 
2.303 
2.127 
2.175 
1.691 
2.313 
2.046 
1.965 
1.644 
1 .716 
I .664 
2.153 
2.026 
1.651 
2.246 
'2.066 
1.769 
S.050 
1.652 
1.712 



-1.469 

-I .097 

-1.092 

-1.425 

-1.155 

-1.077 

-1.076 

-1.456 

- 1 . 604 

-1.325 

-1.496 

-1.426 

-1.496 

-1.161 

-1 .027 

-1.334 

-1.492 

-I .306 

-I .652 

-1.160 

-1.590 

-1.036 

-1.390 

-i.624 

-1.132 

-1.039 

-1.560 

-I .346 

-1.437 

-1.306 

-1.504 

-1.006 

-1.165 

-1.164 

-1 .221 

-1.265 

-1.036 

-1.329 

-1.372 

-1.461 

-1.650 

-1.063 

-1.295 

-1.596 

-1.046 

-1.104 



.526 
.630 
.420 
.409 
.526 
.780 
.596 

.621 

.377 

.951 

.960 

.918 

.752 

.472 

.643 

.702 

.961 

.665 

.914 

.641 

.392 

.919 

.504 

.666 

.996 

.503 

.952 

.646 

.766 

.779 

.919 

.715 

.426 

.405 

.404 

.936 

.594 

.722 

.747 

.647 

.613 

.391 

.609 

.442 

.465 

.750 



Stratum 
5 

-.129 
.137 
-.321 
.303 
-.230 
-.279 
-.267 
-.041 
-.187 
-.055 
.260 
-.075 
.iB01 
.146 
-.057 
.200 
-.327 
-.330 
.099 
.125 
.149 
-.166 
.070 
-.301 
;U0 
-.027 
•.144 
.242 
.215 
.094 
.115 
-.225 
-.196 
.194 
.169 
.277 
.196 
.273 
.076 
-.232 
.076 
-.329 
-.219 
.229 
.099 
-.291 



.421 

.747 

.453 

.955 

.899 

.864 

.670 

.726 

.667 

.793 

.531 

.644 

.453 

.533 

.705 

.709 

.669 

.755 

.524 

.921 

.593 

.766 

.657 

.608 

.967 

.633 

.434 

.443 

.649 

.959 

.525 

.424 

.981 

.427 

.930 

.551 

.622 

.377 

,979 

.626 

.912 

.606 

.672 

.576 

.511 

.361 



1'. 546 

1.570 

1.246 

1.490 

1.113 

1.530 

1.391 

1.272 

1.665 

1.637 

1.514 

1.836 

1.190 

1.220 

I .615 

1.611 

1.030 

1.620 

1.026 

1.496 

1.184 

1.496 

1.085 

1 . 196 

1.652 

1.519 

1.124 

1.376 

1.093 

1.019 

t.141 

1.368 

1.519 

1.335 

1. 1 36 

1.171 

1.569 

1.127 

1.114 

1.455 

1.303 

1.066.^ 

1.069 

1.552 

1.332 

1.152 



8 

1.999 

2.069 

1.999 

2.314 

1.806 

2.161 

2.003 

2.169 

1.643 

2.315 

1.905 

1.797 

2.105 

1.671 

1.703 

1.796 

1.935 

2.149 

2.056 

2.167 

1.991 

1.629 

2.310 

2.3T1 

2.274 

2.230 

1.799 

2.166 

2.124 

2.034 

U845 

I .752 

2.200 

1.699 

t .767 

1.731 

1.644 

1.626 

2.267 

2.102 

1.714 

1.615 

8.023 

2.204 

1.612 

1.666 



2.630 

2.696 

2.334 

2.666 

2.693 

2.966 

2.450 

2.537 

2.359 

2.605 

2.503 

2.512 

2.967 

2.413 

2.493 

2.591 

2.507 

2.460 

2.421 

2.775 

2.969 

2.946 

2.955 

2.393 

2.403 

2.535 

2.757 

2.767 

2.760 

2.562 

2.496 

2.615 

2.719 

2.356 

2.642 

2.547 

2.950 

2.595 

2.455 

2.564 

2.763 

2.600 

2.676 

2.726 

2.666 

2.376 



Mean 

S.D. 



•2.710 
.192 



1.990 
.196 



-1.306 
.196 



•.664 
.196 



•.012 
.202 



.676 
.163 



1.337 
.206 



8.000 
.169 



2.637 
.193 
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APPEKDIX C 

Computer Hardware and Software Used for the Simulations 
The Computer 

The computer used for this research was a Hewlett-Packard 9600E real-time 
computer system, which Is based on a 2100S central proceaaer with a memory 
consisting of 32K of l6-blt words* Peripheral equipment available consisted 
of one disk with a capacity of about 5 million ASCII characters^ a high-speed 
paper tape reader, a teletype and four cathode ray terminals (CBTs). 

Integer numbers were represented In a single computer word with a maximum 
value of ±32768* Real numbers were represented as two l6-blt words having 
about 8l:£ significant digits* For long addition where this waa not sufficient 
precision, double-precision arithmetic was used providing 13-dlglt significance 
by using three l6*blt words* 

Approximately half of the total memory waa used by the computer operating 
system, but the half remaining proved adequate for all simulation programs used* 
The disk had room to store test scores for 15,000 teateea after space waa pro- 
vided for system programs and simulation programs* 

The Program System 

To make efficient use of available computer time. It was necessary to run 
the simulation programs from 5;00 each evening until 8; 00 the nesct mjornlng as 
well as from 5;00 Friday evening until 8:00 Monday morning* Since the simula- 
tion program could fill Its available atorage apace In an hour or so. It was 
obvious that the data would have to be generated, analyzed, eraaed, and re- 
generated* To make this process semi-automatic, so that a programmer would not 
have to be present to start a new program each hour, a program system organized 
as shown in Figure C-1 was constructed* 

Scheduling * The first step In running the almulatlon system was to write 
a schedule of programs and enter this through the teletype* This schedule was 
then read by the scheduling program which In turn scheduled a certain procesalng 
program to run* The schedule might have been, for example, as follows; 1) 
generate and score 1000 conventional test response records using Item parameter 
file 1 and a normal ability distribution; 2) run a correlation program to cor- 
relate the generating ability with the test score; 3) print the results; 
4) generate 15,000 teat records using a rectangular ability distribution; 5) 
calculate the information values; 6) print the reaults; and 7) stop* 

The achedullng program would read the first element of the schedule and, 
seeing that It said to generate data, would schedule the data generating 
program 1000 times, etc* After each program finished, control was returned to 
the scheduling program which then scheduled another process or stopped* 

The data generation program * The core of this syatem was the data gen- 
eration program* On ordera from the scheduling program. It f Irat selected an 
item parameter file to work from and then generated an ability either randomly 
from a normal distribution or at a fixed level aa dictated by the schedule* 
If the schedule called for a stradaptlve test, an Initial ability estimate was 
generated which was either fixed at the mean pf tlite ability distribution (so 
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everyone entered at the middle stratum) or sampled from a normal distribution 
about the mean of the ability distribution with a varying degree of correlation 
with the generating ability* A test protocol was then generated and scored 
with length being controlled either by the standard termination criterion for 
Variable-Length Stradaptive, or a schedule-dictated termination criterion for 
Fixed-Length Stradaptive. The scores from this test were then written on the 
data file and the procedure was repeated until a sufficient number of testees 
had been run* 

Figure C-1 

Schematic Representation of the 
Simulation Program System (Arrows 
Represent Plow of Infonoation) 




The data analysis programs . The data analysis programs read from the data 
file generated by the data generation program, calculated appropriate statis- 
tics, and printed the results on the teletype. Following this, the data file 
was free to be written upon again* All statistical analyses were programmed 
speclflnally for this research, using common formulas for descriptive statis- 
tics and correlations, and formulas described In the Data Analysis section for 
informational statistics* 
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